allow to start from sam files?

ohlab / GRiD

Growth Rate Index (GRiD) measures bacterial growth rate from reference genomes (including draft quality genomes) and metagenomic bins at ultra-low sequencing coverage (> 0.2x).

31 stars 6 forks source link

allow to start from sam files? #6

Closed housw closed 5 years ago

housw commented 5 years ago

Hi,

this is a feature request instead of a bug report. I'm wondering would it be easy for you to make an option to allow starting from aligned sam or bam files? It might make sense since we usually have these files generated for other purposes as well.

Please advice, thanks in advance!

Best, Shengwei

ohlab commented 5 years ago

Hi Shengwei, @housw yes, that is doable especially for the "single mode".

However, for the "multiplex" mode, SAM/BAM files supplied would need to be derived from reads mapped to the GRiD database or a custom-generated GRiD database since downstream steps relies on the database format.

Regards,

Tunde

housw commented 5 years ago

Hi Tunde,

yes, I agree. It will be highly appreciated if you could implement this feature in the 'single' mode.

Thanks a lot.

Cheers, Shengwei

ohlab commented 5 years ago

Ok great. I'll do this in the coming days and release a new version

Tunde

housw commented 5 years ago

That's cool, looking forward to the next release! :+1:

nigiord commented 5 years ago

Hi there,

I agree this would be very useful to scale GRiD for large analysis (lot of samples) since fastq files can be huge and a pain to archive or generate from archived files.

If starting from SAM/BAM is too much of a hassle to implement right now, something simpler that would definitively help would be to allow zipped files. I've looked at the code and it seems to me this would be possible without much change since the extension *.fastq is only used in bowtie2 there : https://github.com/ohlab/GRiD/blob/master/grid#L349-L356 https://github.com/ohlab/GRiD/blob/master/grid#L501-L513 and bowtie2 allows the use of fastq.gz files directly.

A solution would thus be to simply add an option: -r_ext Extension of the files containing reads (fastq, fq, fastq.gz, etc)

Cheers, Nils

aemiol commented 5 years ago

Hi Nils, totally agree. It will be easy to implement input choices of zipped fastq or SAM/BAM. I haven't been able to implement this due to other projects but hope to get to it during the weekend.

Thanks! Tunde

aemiol commented 5 years ago

@housw I just released a new version that accepts SAM files as input. Thanks to @nigiord who also expanded support for different input file extensions.

Cheers, Tunde

housw commented 5 years ago

Hi Tunde,

that's awesome, thanks a lot for your hard work over the weekend. I'm going to test it with my data set and will keep you updated.

Cheers, Shengwei

nigiord commented 5 years ago

Thank you indeed! That's gonna save quite some time for analysis with lot of samples.

However, SAM inputs are only valid in the 'single' module.

Any technical limitations that impede the use of SAM inputs for the multiplex module? Is it also a problem for GRiD if the SAM inputs have been generated using paired-end reads?

Cheers, Nils

In fact I've been trying to use GRiD 1.2 these last weeks on a subset of my data, and I happen to have a couple of technical questions and suggestions. I'll probably ask them elsewhere since this thread is focused on SAM inputs. Would you prefer a single issue containing all my points or an issue for each point?

aemiol commented 5 years ago

@nigiord Its fine if the SAM inputs are generated from paired-end reads. I avoided SAM inputs for multiplex module since they have to be generated from reads mapped to the GRiD database. Sure, you can open a single thread regarding your other suggestions

housw commented 5 years ago

Hi Tunde,

GRiD works great with my sam files, thanks a lot!

Cheers, Shengwei

aemiol commented 5 years ago

That is good to know. Cheers

Tunde