NanoSV takes too long parsing BAM file in plant genome

mroosmalen / nanosv

SV caller for nanopore data

MIT License

90 stars 22 forks source link

NanoSV takes too long parsing BAM file in plant genome #57

Closed Biometeor closed 5 years ago

Biometeor commented 5 years ago

Dear NanoSV developers, I are runing NanoSV in order to identify SV in the plant genome， the genome info as follows： Title Total_length Total_number Num>=100 Num>=2000 Average_length Max_length Min_length N50_length N50_number N60_length N60_number N70_length N70_number N80_length N80_number N90_length N90_number Contig 908578321 714921 714541 32345 1270 386731 2 36867 6156 24550 9174 13032 14164 2504 29048 461 128499 I aligned the reads with last or minimap ， This resulted in about 20 gigabytes BAM file，and I didn't provide bed file. the command ： python NanoSV.py --sambamba sambamba --config config.ini tmp.sorted.bam -o tmp.vcf NanoSV is stuck with parsing the BAM file for 12 days! and it's still running。 Thank you very much

Biometeor commented 5 years ago

i have try runing NanoSV to identify bird SV in the same way， iIt runs successfully and quickly

mroosmalen commented 5 years ago

Did you use the same config.ini file for both of them? How does the your config.ini look? Did you set the depth_support on False? Because the "default" bed file is only compatible with the human reference genome and this will be used by default (depth_support=True).

Biometeor commented 5 years ago

yes! I use the same config.ini file for both of them；I‘m sure that I set the depth_support on False ，and I just changed the parameter （depth_support）of the default config.ini.

Biometeor commented 5 years ago

Hi I just run NanoSV to identify bird SV with 800 reads . It runs quickly in the bird genomes， but it is still runing in the plant genome， so I wonder if the reason why the task is so slow is that there are too many scaffolds in plant genome? it has 685354 scaffold in genome 。 Here are the files I tested in plant genome： file.tar.gz

mroosmalen commented 5 years ago

I did some debugging on your data and it looks like the problem is indeed the many scaffolds. I will took every 20 seconds to process the next scaffold. This will takes ~160 days to run, if you process the scaffolds one by one. You can try to reduce this by giving it more threads (default 4) on the command line -t 4

Biometeor commented 5 years ago

NanoSV is stuck with parsing the test BAM （15 MB）file for 20 h with the parameter （-t 100）， This task has a maximum memory of 104.7 and a CPU of 29.2 ，I can't imagine how much time and memory would be spent on a 20 G BAM file.

mroosmalen commented 5 years ago

You can also split your bam file per scaffold, if you not interested in inter-scaffold variants (translocations).

Biometeor commented 5 years ago

Maybe that's the only way.thank you very much!