metaGmetapop / metapop

A pipeline for the macro- and micro-diversity analyses and visualization of metagenomic-derived populations
MIT License
37 stars 10 forks source link

ValueError: invalid literal for int() with base 10 #11

Open liupfskygre opened 2 years ago

liupfskygre commented 2 years ago

Hi, Ann, I got an error when running metapop installed from pip with the following command: metapop --input_samples ./bamfile --reference ./reference --norm tp-notp-166-metapop_ctfile.txt --threads 60

the installation should be fine since i run the toy dataset and it successfully done.

Following is error info, do you have any suggestions on how to fix it.

Thanks. Pengfei

error info

File "/home/PTPE2/Software/miniconda3/envs/metapop/lib/python3.7/site-packages/metapop/metapop_mine_reads.py", line 450, in do_mine_reads res = access_read_ranges(selections_to_read, threads, output_directory) File "/home/PTPE2/Software/miniconda3/envs/metapop/lib/python3.7/site-packages/metapop/metapop_mine_reads.py", line 202, in access_read_ranges res = pool.map(read_one_range, ranges) File "/home/PTPE2/Software/miniconda3/envs/metapop/lib/python3.7/multiprocessing/pool.py", line 268, in map return self._map_async(func, iterable, mapstar, chunksize).get() File "/home/PTPE2/Software/miniconda3/envs/metapop/lib/python3.7/multiprocessing/pool.py", line 657, in get raise self._value ValueError: invalid literal for int() with base 10: 'KQGRI2_20_08_k141_904564'

liupfskygre commented 2 years ago

I rerun the command again and with the following errors similar to above ones,

Reference base at each position will be the consensus of all files.
Getting codon usage bias...
Finalizing SNPs...
Updating genes with consensus bases...
Updating genomes with consensus bases...
MetaPop SNP refinement finished at: 05/02/2022 11:40:48
Linking SNPs starting at: 05/02/2022 11:40:48...multiprocessing.pool.RemoteTraceback:

Traceback (most recent call last):
  File "/home/PTPE2/Software/miniconda3/envs/metapop/lib/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "/home/PTPE2/Software/miniconda3/envs/metapop/lib/python3.7/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/home/PTPE2/Software/miniconda3/envs/metapop/lib/python3.7/site-packages/metapop/metapop_mine_reads.py", line 143, in read_one_range
    leftmost = int(segs[3].decode())
ValueError: invalid literal for int() with base 10: 'TGLS2_1908_Scaff092085'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/PTPE2/Software/miniconda3/envs/metapop/bin/metapop", line 8, in <module>
    sys.exit(main())
  File "/home/PTPE2/Software/miniconda3/envs/metapop/lib/python3.7/site-packages/metapop/metapop_main.py", line 300, in main
    linked_file = metapop.metapop_mine_reads.do_mine_reads(output_directory_base, threads)
  File "/home/PTPE2/Software/miniconda3/envs/metapop/lib/python3.7/site-packages/metapop/metapop_mine_reads.py", line 450, in do_mine_reads
    res = access_read_ranges(selections_to_read, threads, output_directory)
  File "/home/PTPE2/Software/miniconda3/envs/metapop/lib/python3.7/site-packages/metapop/metapop_mine_reads.py", line 202, in access_read_ranges
    res = pool.map(read_one_range, ranges)
  File "/home/PTPE2/Software/miniconda3/envs/metapop/lib/python3.7/multiprocessing/pool.py", line 268, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/home/PTPE2/Software/miniconda3/envs/metapop/lib/python3.7/multiprocessing/pool.py", line 657, in get
    raise self._value
ValueError: invalid literal for int() with base 10: 'TGLS2_1908_Scaff092085'
liupfskygre commented 2 years ago

Hi, Ann, I checked things in more detail. I checked the metapop_mine_reads.py, and see segs[3] is defined as ref_base = segs[3],

fh = open(file)
    for line in fh:
        if line.endswith("True\n"):
            segs = line.strip().split("\t")
            #contig_pos = segs[0]
            contig = segs[1]
            pos = int(segs[2])
            ref_base = segs[3]
            source = segs[9]
            snps = segs[10]
            contig_gene = segs[11]
            #if OC == 1, strand = forward, else strand = reverse
            OC = int(segs[14])
            codon = int(segs[15])
            pos_in_codon = int(segs[16])

            linked_data[source][contig][contig_gene][codon][OC].append([pos, ref_base, snps, pos_in_codon])

    fh.close()

I guess the file is refer to the genic_snps.tsv file in the MetaPop/07.Cleaned_SNPs dir with the header, right?

contig_pos  contig  pos ref_base    depth   a_ct    t_ct    c_ct    g_ct    source  snps    contig_gene start   end OC  codon   pos_in_codon    link

if so,

then ref_base =segs[3] should be one base 'A', 'T', 'C', 'G', right?

in my case, it becomes something else.

and even with ATCG, int(segs[3]) will raise an error, int('T')

so, what is the file here refer to, and how could this been fixed?

thanks, Pengfei

metaGmetapop commented 2 years ago

Hi Pengfei - let me pass these errors on to Kenji. He's the mastermind behind the new code. We'll get back to you soon!

KGerhardt commented 2 years ago

That line caused the same error for another user. The problem was that the mapping tool he had used, BBmap, took more information from the deflines of his reads than the sequence ID, and the additional information contained whitespaces.

The split to create segs in the mine_reads script is done by issuing a call to samtools, reading the output into python, and splitting the line on whitespace. If there are more whitespaces than expected, then the position of the read in the reference genome is shifted past the 4th position in the split line.

We have a new version of the code up already that fixes this problem. The split happens on tabs (samtools output is tab-separated)) instead of separating on any whitespace.