liu-congcong / MetaDecoder

An algorithm for clustering metagenomic sequences.
GNU General Public License v3.0
30 stars 2 forks source link

IndexError during cluster step #4

Closed ohickl closed 1 year ago

ohickl commented 2 years ago

Hi, I am trying to run MetaDecoder on some of the CAMI 2 toy samples, but for most of them get an error during the clustering step:

Creating sam files.
+ parallel -j 16 'samtools view -@ 1 -h {} -o .../sam_files/{/.}.sam' ::: .../mapping_sorted.bam
+ SAM_FILES=(${OUT_DIR}/sam_files/*.sam)
+ metadecoder coverage -s .../sam_files/mapping_sorted.sam -o .../METADECODER.COVERAGE
2022-07-08 14:04:35 -> Loading sam files.
2022-07-08 14:04:48 -> Done.
2022-07-08 14:04:48 -> Writing to file.
2022-07-08 14:04:49 -> Finished.
+ rm -rf .../sam_files
+ metadecoder seed -f .../assembly.fasta --threads 16 -o .../METADECODER.SEED
2022-07-08 14:04:52 -> Identifying protein sequences.
2022-07-08 14:05:46 -> Done.
2022-07-08 14:05:46 -> Mapping marker genes to protein sequences.
2022-07-08 14:05:54 -> Done.
2022-07-08 14:05:54 -> Writing to file.
2022-07-08 14:05:54 -> Finished.
+ metadecoder cluster -f .../assembly.fasta -c .../METADECODER.COVERAGE -s .../METADECODER.SEED -o .../METADECODER
2022-07-08 14:05:55 -> Loading fasta file.
2022-07-08 14:05:55 -> Done.
2022-07-08 14:05:55 -> Loading coverage file.
2022-07-08 14:05:55 -> Done.
2022-07-08 14:05:55 -> Counting kmers of all sequences.
2022-07-08 14:05:56 -> Done.
2022-07-08 14:05:56 -> Loading seeds file.
2022-07-08 14:05:56 -> Done.
2022-07-08 14:05:56 -> Running the DPGMM algorithm to obtain clusters.
2022-07-08 14:05:56 -> Done.
length sequences_: 794
c_longlong, total_sequences: <class 'ctypes.c_long'>, 3012
length container: 3012
sequence 3027 from sequences_ not in container, container_value: -3013
Traceback (most recent call last):
  File ".../metadecoder_cluster.py", line 354, in main
    container[sequence] = container_value
IndexError: invalid index

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File ".../metadecoder", line 246, in <module>
    metadecoder_cluster.main(parameters)
  File ".../metadecoder_cluster.py", line 357, in main
    raise IndexError
IndexError

I am unsure what to make of this, as it runs fine for a few of samples. Let me know, if you require the COVERAGE or SEED file, or any others.

Best

Oskar

liu-congcong commented 2 years ago

Thank you for using MetaDecoder. Which CAMI dataset causes this error? Please provide me the URL of the assembly and send the coverage file to congcong_liu@icloud.com, thank you very much.

In addition, these tips may help you. E.g. metadecoder cluster -f ASSEMBLY -c COVERAGE -s SEED -o OUTPUT The COVERAGE file is created using metadecoder coverage. The inputs are a series of sam files, please make sure that all sam files have the same @SQ lines that corresponds to the ASSEMBLY file.

ohickl commented 2 years ago

Thanks for taking a look at this! I sent you the coverage file for a sample as well as a link to the assembly.

liu-congcong commented 2 years ago

I run MetaDecoder on this dataset without any error.

STEP 1. java -jar camiClient.jar -d https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/CAMI_Urogenital_tract . -p short_read/2017.12.04_18.56.22_sample_2/contigs/anonymous_gsa.fasta.gz mv anonymous_gsa.fasta.gz METADECODER.ASSEMBLY

STEP 2. metadecoder seed -f METADECODER.ASSEMBLY -o METADECODER.SEED md5sum METADECODER.SEED (dbec6c77e574d9e885e49c1fbb11b485)

STEP 3. metadecoder cluster -f METADECODER.ASSEMBLY -s METADECODER.SEED -c METADECODER.COVERAGE -o METADECODER md5sum METADECODER.COVERAGE (4a092a997d8c5d17bd6828fd58e06302)

Got 45 clusters.

liu-congcong commented 2 years ago

Python 3.8.10 and MetaDecoder 1.0.12. Step 3 was tested on both Linux 3.10.0-862.el7.x86_64 and MacOS 12.4.

liu-congcong commented 2 years ago

I guess I know why.

2022-07-08 14:05:56 -> Running the DPGMM algorithm to obtain clusters. 2022-07-08 14:05:56 -> Done.

This means that you cached the dpgmm results, which are prefixed with the assembly file name. Usually the assembly file name is unique, so different dpgmm files are generated.

If you successively set different assembly files with the same name, it will cause the latter one to use the former's dpgmm cache.

Therefore, you may need to remove the dpgmm cache before running the metadecoder cluster as follows: rm *.dpgmm metadecoder cluster .......

I hope this will solve your problem.

ohickl commented 2 years ago

Ah, makes sense. I was calling multiple jobs for different samples with identical assembly names from the same location. Changing the working dir to that of the respective output dir fixed it. Thanks! Might be useful to have a tmp_dir/work_dir flag for MetaDecoder, to allow the user to set it and avoid potential confusion?