HG19 input crashes NEAT

joshfactorial commented 3 years ago

Describe the bug Hi there, I just downloaded the latest release (compatible with python 3) and after the installation of all requirements, I tried to run gen_reads.py. But even if I don't get any error (only a warning regarding the bed file Warning: Reference contains sequences not found in targeted regions BED file. but I saw it's quite normal, and I know reference matches bed file because I used them in the germline pipeline) the output files are still empty after 7-8 hours)

This is my env

python==3.8.8 (I tried also with python 3.8.5 and 3.6 but it doesn't work) numpy==1.19.5 biopython==1.78 matplotlib==3.3.4 matplotlib_venn==0.11.6 pandas==1.2.1 pysam==0.16.0.1

This is my command using hg38 as ref

python gen_reads.py -r ../../pipelines/genomes/hg38_analysisSet/hg38.analysisSet.fa -R 101 -o ../../pipelines/data/simulated_data/test_1 --vcf --pe 300 30 -tr ../../pipelines/genomes/geneAnnotations/hg38.exome.pad20nts.ncbiRefSeq.bed -c 50

This is my log

Using default sequencing error model. Using default gc-bias model. Using artificial fragment length distribution. mean=300, std=30 found index ../../pipelines/genomes/hg38_analysisSet/hg38.analysisSet.fa.fai Warning: Reference contains sequences not found in targeted regions BED file. reading chr1...

I'm running it on Ubuntu 20.04

While when I use hg19 I get this error [Traceback (most recent call last): File "gen_reads.py", line 902, in main() File "gen_reads.py", line 549, in main print(f'PROCESSING WINDOW: {(start, end), [buffer_added]}, ' UnboundLocalError: local variable 'buffer_added' referenced before assignment

thanks :)

NOTE: I'm reproducing this bug from Zach's repository. It was originally submitted by @joys8998

joshfactorial commented 3 years ago

2 things to check:

That the first command eventually produces data, it's just due to NEAT's slowness that it the files were still empty later.
Try to recreate the bug when using hg19. I'll assume the command is the same otherwise to the original command.

joshfactorial commented 3 years ago

Here is the output after 18 hours on an HPC:

 Using default sequencing error model.
Using default gc-bias model.
Using artificial fragment length distribution. mean=300, std=30
found index /projects/bioinformatics/jallen17/Reference/Homo_sapiens_assembly38.fasta.fai
Warning: Reference contains sequences not found in targeted regions BED file.
reading chr1...
29107.422 (sec)
--------------------------------
sampling reads...
[----------]
Read sampling completed in 6589 (sec)
Writing output VCF...
reading chr2...
4387.950 (sec)
--------------------------------
sampling reads...
[----------]
Read sampling completed in 5419 (sec)
Writing output VCF...
reading chr3...
3583.005 (sec)
--------------------------------
sampling reads...
[----------]
Read sampling completed in 4546 (sec)
Writing output VCF...
reading chr4...
1646.746 (sec)
--------------------------------
sampling reads...
[----------]
Read sampling completed in 3484 (sec)
Writing output VCF...
reading chr5...
6101.071 (sec)
--------------------------------
sampling reads...
[---

So you can see the timing involved. It takes about 8 hours on this machine to read Chr1, which is of course the biggest. Subsequent chroms are a little faster.

joshfactorial commented 3 years ago

I am unable to replicate the hg19 bug. I would need to know the exact command and file used to see if I could reproduce it.

joys8998 commented 3 years ago

Hi Josh,

Sorry for the late reply!! Anyway Thanks for re report about the hg38 run. I didn't know it takes so long to run, but of course I'll do some other test. But the problem was that after 8 hours the files were still empty, so I'm not sure that more time could solve it. For hg19 i used this command

python gen_reads.py -r ../../pipelines/genomes/hg19_analysisSet/hg19.analysisSet.fa -R 101 -o ../../pipelines/data/simulated_data/test_3 --vcf --pe 300 30 -tr ../../pipelines/genomes/geneAnnotations/hg19.exome.Twist.bed -c 50 -d

and I used this reference

joshfactorial commented 3 years ago

As far as empty files goes, this is expected, as NEAT opens files for writing long before it has data to write to the files. This is a section of the code that's currently under development. You can check that NEAT is still working using the command 'top' on the command line. You should see 'python' near the top of the machine running the processing (on an HPC cluster, you may need to ssh into the worker node first). If it is stalled out, then the CPU usage will drop off to 0. In the mean time I'll try your command on my end.

joys8998 commented 3 years ago

Thank you very much. I'll try :)

joshfactorial commented 3 years ago

I found a couple of bugs, actually related to the debug flag (-d), when I ran this command. It's up and running now, though I could not find a free version of the bed file you are using, so I'm using one from Broad. If you clone this repo or pull for the latest changes, let's see if that clears up the problem. Otherwise, kindly paste in the complete output from NEAT or upload as a file. This will greatly assist in debugging. Thanks!

joshfactorial commented 3 years ago

There was also a bug where NEAT was trying to look at commented header lines from the bed, which I also fixed.

ncsa / NEAT

HG19 input crashes NEAT #4