mbhall88 / tbpore

Mycobacterium tuberculosis genomic analysis from Nanopore sequencing data
MIT License
11 stars 2 forks source link

Test the impact on clustering of changing `minimap2` index to use less RAM #31

Open leoisl opened 2 years ago

leoisl commented 2 years ago

The current minimap2 index was built with -I 12G to match the H2H index. This pushes the tbpore RAM usage when running tbpore process to 13.1GB. We could instead build the index with -I 500M, which would take the tbpore process RAM down to ~5GB, which is much more runnable in a personal laptop, but then the results are not identical to the H2H results. We should evaluate the impact of this different index on the clustering and on the tbpore results in general, and infer if is indeed OK to switch to this lighter index. This might be related to https://github.com/mbhall88/tbpore/issues/22

leoisl commented 1 year ago

We might be able to keep -I 12G and still use less RAM. The trick would be to use this minimap2 param:

       --idx-no-seq
                 Don't  store  target  sequences  in  the index. It saves disk
                 space and memory but the index  generated  with  this  option
                 will  not  work  with -a or -c.  When base-level alignment is
                 not requested, this option is automatically applied.

... although when we map reads to the decontamination minimap2 index, we do require base-level mapping (i.e. we run with flags -aL). But looking downstream I think we don't need these flags and can parse a PAF file. It all depends on whether we indeed need to decrease RAM or not. @FlorianePoint could you please tells us if you have observed any RAM issue when running tbpore either on your site or in Madagascar? Thanks!

FlorianePoint commented 1 year ago

Hi Leandro, Yes we (Nanah in Mada and I) already had RAM issue when using tbpore with a minimap2 return code -9. It happened when we had less than 13G free. Floriane

mbhall88 commented 1 year ago

But looking downstream I think we don't need these flags and can parse a PAF file.

Correct. I used to extract the reads from the SAM, but have have since switch to using seqkit grep to get the read ids from fastqs. So PAF will be fine I think.