liu-congcong / MetaDecoder

An algorithm for clustering metagenomic sequences.
GNU General Public License v3.0
30 stars 2 forks source link

contig names #7

Closed hildebra closed 1 year ago

hildebra commented 1 year ago

Hey, I was trying MetaDecoder after the nice Microbiome publication, but I got the following error: " File "/hpc-home/hildebra/env/anac3/envs/MFF/bin/metadecoder", line 216, in metadecoder_coverage.main(parameters) File "/hpc-home/hildebra/env/anac3/envs/MFF/lib/python3.8/site-packages/metadecoder/metadecoder_coverage.py", line 91, in main sequence2bin_coverages[lines[0]][int(lines[1]), coverage_index] += float(lines[2]) KeyError: 'P1E7M3__C4_L=404=' " Since my contigs are all named similar, I was wondering if this could be kept as I need to trace the contig name through my experiments.

Further I wanted to ask, could MetaDecoder also work with .bam or .cram files? This would certainly safe some space on our cluster. many thanks, Falk

liu-congcong commented 1 year ago

Thank you very much for using MetaDecoder. MetaDecoder now only supports sam formatted files, because I don't like to import too many python packages for bam files. Contig names can be similar, but must be unique and need to conform to the fasta file format specification. The reason for this error may be that the headers of all sam files are not the same, because MetaDecoder does not check the headers of all sam files. Please make sure that “P1E7M3__C4_L=404=” is present in the header section of all input sam files. This means that the assembly files used to generate these sam files are identical. Let me know if you have any other questions, but right now I'm going to take a nap, haha. I hope these hints could be helpful to you.

hildebra commented 1 year ago

Hey Liu, thanks for your quick reply, hope you got some good sleep ;) The fasta file was the same, it's a co-assembly and each sam represents a mapping (bowtie2) to this one fasta with contigs, of the several samples the assembly was generated from. I have use exactly the same bams with metabat2 and semibin, and they completed without complaining. However, for metadecoder I re-ectracted a .cram into a .sam using samtools. About bam/sam: sure i understand, this can get messy having too many python libs. but on the other hand, having an interface to htslib or similar could make your program much faster and deal with all kinds of sam/bam formatting differences that different aligners might produce, so could in the long term also take some work off your hands. All the best, Falk

liu-congcong commented 1 year ago

Thank you very much for using MetaDecoder. Suppose you have two sam files, i.e., 1.sam, 2.sam. Please use samtools to get their header sections. samtools view -H 1.sam | grep SQ > 1.sam.h samtools view -H 2.sam | grep SQ > 2.sam.h To compare the differences between two headers, please use md5 (macOS) or md5sum (Linux) as follows: md5sum *.sam.h These results should be the same. If the above steps did not solve your query, I would be very grateful for your help if you could compress these headers and send them to me (congcong_liu@icloud.com). I may plan to update the related code, if necessary.

hildebra commented 1 year ago

Hey Liu, I didn't add the "-h" flag to samtools, problem solved, metadecoder processed my data very fast, thanks for your quick responses on this ! Falk