Closed joshfactorial closed 2 years ago
2 things to check:
Here is the output after 18 hours on an HPC:
Using default sequencing error model.
Using default gc-bias model.
Using artificial fragment length distribution. mean=300, std=30
found index /projects/bioinformatics/jallen17/Reference/Homo_sapiens_assembly38.fasta.fai
Warning: Reference contains sequences not found in targeted regions BED file.
reading chr1...
29107.422 (sec)
--------------------------------
sampling reads...
[----------]
Read sampling completed in 6589 (sec)
Writing output VCF...
reading chr2...
4387.950 (sec)
--------------------------------
sampling reads...
[----------]
Read sampling completed in 5419 (sec)
Writing output VCF...
reading chr3...
3583.005 (sec)
--------------------------------
sampling reads...
[----------]
Read sampling completed in 4546 (sec)
Writing output VCF...
reading chr4...
1646.746 (sec)
--------------------------------
sampling reads...
[----------]
Read sampling completed in 3484 (sec)
Writing output VCF...
reading chr5...
6101.071 (sec)
--------------------------------
sampling reads...
[---
So you can see the timing involved. It takes about 8 hours on this machine to read Chr1, which is of course the biggest. Subsequent chroms are a little faster.
I am unable to replicate the hg19 bug. I would need to know the exact command and file used to see if I could reproduce it.
Hi Josh,
Sorry for the late reply!! Anyway Thanks for re report about the hg38 run. I didn't know it takes so long to run, but of course I'll do some other test. But the problem was that after 8 hours the files were still empty, so I'm not sure that more time could solve it. For hg19 i used this command
python gen_reads.py -r ../../pipelines/genomes/hg19_analysisSet/hg19.analysisSet.fa -R 101 -o ../../pipelines/data/simulated_data/test_3 --vcf --pe 300 30 -tr ../../pipelines/genomes/geneAnnotations/hg19.exome.Twist.bed -c 50 -d
and I used this reference
As far as empty files goes, this is expected, as NEAT opens files for writing long before it has data to write to the files. This is a section of the code that's currently under development. You can check that NEAT is still working using the command 'top' on the command line. You should see 'python' near the top of the machine running the processing (on an HPC cluster, you may need to ssh into the worker node first). If it is stalled out, then the CPU usage will drop off to 0. In the mean time I'll try your command on my end.
Thank you very much. I'll try :)
I found a couple of bugs, actually related to the debug flag (-d), when I ran this command. It's up and running now, though I could not find a free version of the bed file you are using, so I'm using one from Broad. If you clone this repo or pull for the latest changes, let's see if that clears up the problem. Otherwise, kindly paste in the complete output from NEAT or upload as a file. This will greatly assist in debugging. Thanks!
There was also a bug where NEAT was trying to look at commented header lines from the bed, which I also fixed.
Describe the bug Hi there, I just downloaded the latest release (compatible with python 3) and after the installation of all requirements, I tried to run gen_reads.py. But even if I don't get any error (only a warning regarding the bed file Warning: Reference contains sequences not found in targeted regions BED file. but I saw it's quite normal, and I know reference matches bed file because I used them in the germline pipeline) the output files are still empty after 7-8 hours)
This is my env
python==3.8.8 (I tried also with python 3.8.5 and 3.6 but it doesn't work) numpy==1.19.5 biopython==1.78 matplotlib==3.3.4 matplotlib_venn==0.11.6 pandas==1.2.1 pysam==0.16.0.1
This is my command using hg38 as ref
python gen_reads.py -r ../../pipelines/genomes/hg38_analysisSet/hg38.analysisSet.fa -R 101 -o ../../pipelines/data/simulated_data/test_1 --vcf --pe 300 30 -tr ../../pipelines/genomes/geneAnnotations/hg38.exome.pad20nts.ncbiRefSeq.bed -c 50
This is my log
Using default sequencing error model. Using default gc-bias model. Using artificial fragment length distribution. mean=300, std=30 found index ../../pipelines/genomes/hg38_analysisSet/hg38.analysisSet.fa.fai Warning: Reference contains sequences not found in targeted regions BED file. reading chr1...
I'm running it on Ubuntu 20.04
While when I use hg19 I get this error [Traceback (most recent call last): File "gen_reads.py", line 902, in main() File "gen_reads.py", line 549, in main print(f'PROCESSING WINDOW: {(start, end), [buffer_added]}, ' UnboundLocalError: local variable 'buffer_added' referenced before assignment
thanks :)
NOTE: I'm reproducing this bug from Zach's repository. It was originally submitted by @joys8998