liu-congcong / MetaDecoder

An algorithm for clustering metagenomic sequences.
GNU General Public License v3.0
30 stars 2 forks source link

Adding BAM usage for Metadecoder coverage calculation ? #10

Open avw-adifranco opened 1 year ago

avw-adifranco commented 1 year ago

Hello,

I was wondering if you were planning to add an option to use BAM files instead of SAM files for the coverage calculation ? Or is there a way to retrieve the SAM content from STDOUT ?

I'm asking to help avoiding storage issue on large batch.

Thank you.

liu-congcong commented 1 year ago

Hi avw-adifranco,

Thank you for using MetaDecoder. As you mentioned, to directly read bam files, some additional packages may need to be installed, such as pysam. However, I prefer to make it as light as possible, because samtools is already widely used.

For MetaDecoder, all sam files need to be prepared in advance, and it really takes up a lot of space. I have updated MetaDecoder to version 1.0.17, and now you have a more convenient way to handle clustering of large number of sequencing samples.

# update to the latest version #
pip3 install -U https://github.com/liu-congcong/MetaDecoder/releases/download/v1.0.17/metadecoder-1.0.17-py3-none-any.whl

# calculate coverage for each sample #
for file in *bam
do
samtools view -h -o ${file}.sam ${file}
metadecoder coverage -s ${file}.sam -o 2023-04-19.${file}.metadecoder.coverage
rm ${file}.sam # free space #
done

# load all coverage files for clustering #
metadecoder cluster -c 2023-04-19.*.metadecoder.coverage -f assembly.gz -s seed -o MetaDecoder.cluster

Thank you for your suggestions.

Best,

Cong-Cong

avw-adifranco commented 1 year ago

Hi Cong-Cong,

Thanks for the reply.

I've always used the version 1.0.17 so I was not aware of this change. I understand you do not want to add any dependencies to your software.

However, I believe the only difference here would be how you call open in your read_sam functions, using rb instead of r. It is extra work to implement the parameters but it would save a lot of time in I/O as there would be no need to write a sam file to the disk.

Best, Arnaud

avw-adifranco commented 1 year ago

Hi again,

Sorry, I went too fast without checking and my last option does not work. I'll check for the STDOUT pipe option on my side when I have more time.

Best,

liu-congcong commented 1 year ago

Hi Arnaud,

I have provided a program for merging all sample coverage file into a single file, so you now have another way to do this:


# force reinstall the latest version #
pip3 install --force-reinstall https://github.com/liu-congcong/MetaDecoder/releases/download/v1.0.17/metadecoder-1.0.17-py3-none-any.whl

# obtain the tool for merging #
git clone https://github.com/liu-congcong/FileAligner

# calculate coverage for each sample #
for file in *bam
do
samtools view -h -o ${file}.sam ${file}
metadecoder coverage -s ${file}.sam -o 2023-04-19.${file}.metadecoder.coverage
rm ${file}.sam # free space #
done

# merge all 2023-04-19.*.metadecoder.coverage into a single file #
FileAligner -t 1,2,3 -i 2023-04-19.*.metadecoder.coverage -o coverage

# load the single coverage file for clustering #
metadecoder cluster -c coverage -f assembly.gz -s seed -o MetaDecoder.cluster

Some additional information that may help you:

  1. It is not possible to simply use python's built-in "open" function to read bam files via binary mode, because bam file has its own format (Sequence Alignment/Map Format Specification).

  2. A simple program to read the bam file is provided here (https://github.com/liu-congcong/BamReader)

Best,

Cong-Cong