Open avw-adifranco opened 1 year ago
Hi avw-adifranco,
Thank you for using MetaDecoder. As you mentioned, to directly read bam files, some additional packages may need to be installed, such as pysam. However, I prefer to make it as light as possible, because samtools is already widely used.
For MetaDecoder, all sam files need to be prepared in advance, and it really takes up a lot of space. I have updated MetaDecoder to version 1.0.17, and now you have a more convenient way to handle clustering of large number of sequencing samples.
# update to the latest version #
pip3 install -U https://github.com/liu-congcong/MetaDecoder/releases/download/v1.0.17/metadecoder-1.0.17-py3-none-any.whl
# calculate coverage for each sample #
for file in *bam
do
samtools view -h -o ${file}.sam ${file}
metadecoder coverage -s ${file}.sam -o 2023-04-19.${file}.metadecoder.coverage
rm ${file}.sam # free space #
done
# load all coverage files for clustering #
metadecoder cluster -c 2023-04-19.*.metadecoder.coverage -f assembly.gz -s seed -o MetaDecoder.cluster
Thank you for your suggestions.
Best,
Cong-Cong
Hi Cong-Cong,
Thanks for the reply.
I've always used the version 1.0.17 so I was not aware of this change. I understand you do not want to add any dependencies to your software.
However, I believe the only difference here would be how you call open
in your read_sam
functions, using rb
instead of r
. It is extra work to implement the parameters but it would save a lot of time in I/O as there would be no need to write a sam file to the disk.
Best, Arnaud
Hi again,
Sorry, I went too fast without checking and my last option does not work. I'll check for the STDOUT pipe option on my side when I have more time.
Best,
Hi Arnaud,
I have provided a program for merging all sample coverage file into a single file, so you now have another way to do this:
# force reinstall the latest version #
pip3 install --force-reinstall https://github.com/liu-congcong/MetaDecoder/releases/download/v1.0.17/metadecoder-1.0.17-py3-none-any.whl
# obtain the tool for merging #
git clone https://github.com/liu-congcong/FileAligner
# calculate coverage for each sample #
for file in *bam
do
samtools view -h -o ${file}.sam ${file}
metadecoder coverage -s ${file}.sam -o 2023-04-19.${file}.metadecoder.coverage
rm ${file}.sam # free space #
done
# merge all 2023-04-19.*.metadecoder.coverage into a single file #
FileAligner -t 1,2,3 -i 2023-04-19.*.metadecoder.coverage -o coverage
# load the single coverage file for clustering #
metadecoder cluster -c coverage -f assembly.gz -s seed -o MetaDecoder.cluster
Some additional information that may help you:
It is not possible to simply use python's built-in "open" function to read bam files via binary mode, because bam file has its own format (Sequence Alignment/Map Format Specification).
A simple program to read the bam file is provided here (https://github.com/liu-congcong/BamReader)
Best,
Cong-Cong
Hello,
I was wondering if you were planning to add an option to use BAM files instead of SAM files for the coverage calculation ? Or is there a way to retrieve the SAM content from STDOUT ?
I'm asking to help avoiding storage issue on large batch.
Thank you.