mroosmalen / nanosv

SV caller for nanopore data
MIT License
90 stars 22 forks source link

NanoSV takes too long parsing BAM file in plant genome #57

Closed Biometeor closed 5 years ago

Biometeor commented 5 years ago

Dear NanoSV developers, I are runing NanoSV in order to identify SV in the plant genome, the genome info as follows: Title Total_length Total_number Num>=100 Num>=2000 Average_length Max_length Min_length N50_length N50_number N60_length N60_number N70_length N70_number N80_length N80_number N90_length N90_number Contig 908578321 714921 714541 32345 1270 386731 2 36867 6156 24550 9174 13032 14164 2504 29048 461 128499 I aligned the reads with last or minimap , This resulted in about 20 gigabytes BAM file,and I didn't provide bed file. the command : python NanoSV.py --sambamba sambamba --config config.ini tmp.sorted.bam -o tmp.vcf NanoSV is stuck with parsing the BAM file for 12 days! and it's still running。 Thank you very much

Biometeor commented 5 years ago

i have try runing NanoSV to identify bird SV in the same way, iIt runs successfully and quickly

mroosmalen commented 5 years ago

Did you use the same config.ini file for both of them? How does the your config.ini look? Did you set the depth_support on False? Because the "default" bed file is only compatible with the human reference genome and this will be used by default (depth_support=True).

Biometeor commented 5 years ago

yes! I use the same config.ini file for both of them;I‘m sure that I set the depth_support on False ,and I just changed the parameter (depth_support)of the default config.ini.

Biometeor commented 5 years ago

Hi I just run NanoSV to identify bird SV with 800 reads . It runs quickly in the bird genomes, but it is still runing in the plant genome, so I wonder if the reason why the task is so slow is that there are too many scaffolds in plant genome? it has 685354 scaffold in genome 。 Here are the files I tested in plant genome: file.tar.gz

mroosmalen commented 5 years ago

I did some debugging on your data and it looks like the problem is indeed the many scaffolds. I will took every 20 seconds to process the next scaffold. This will takes ~160 days to run, if you process the scaffolds one by one. You can try to reduce this by giving it more threads (default 4) on the command line -t 4

Biometeor commented 5 years ago

NanoSV is stuck with parsing the test BAM (15 MB)file for 20 h with the parameter (-t 100), This task has a maximum memory of 104.7 and a CPU of 29.2 ,I can't imagine how much time and memory would be spent on a 20 G BAM file.

mroosmalen commented 5 years ago

You can also split your bam file per scaffold, if you not interested in inter-scaffold variants (translocations).

Biometeor commented 5 years ago

Maybe that's the only way.thank you very much!