janka2012 / digIS

Pipeline detecting distant and putative novel insertion sequences in prokaryotic genomes
MIT License
9 stars 0 forks source link

error running test data #4

Open aleuUH opened 3 years ago

aleuUH commented 3 years ago

Hello Janka

I am very interested in trying out your program. Unfortunately i am not able to run the program on the test data:

(/home/leua/e/digiIS) leua@cl5n007:~/digIS> python digIS_search.py -i data/test_data/NC_002608.fasta -g data/test_data/NC_002608.gb -o digis_genbank ===== Processing of NC_002608.1 sequence ===== Seed search...

hmmsearch :: search profile(s) against a sequence database

HMMER 3.3.2 (Nov 2020); http://hmmer.org/

Copyright (C) 2020 Howard Hughes Medical Institute.

Freely distributed under the BSD open source license.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

query HMM file: /home/leua/digIS/data/models/hmm/hmm_all_subfams.hmm

target sequence database: digis_genbank/pep/NC_002608.1.pep

per-dom hits tabular output: digis_genbank/hmmer/NC_002608.1_hmmsearch.hmmer3

show alignments in output: no

domain reporting threshold: score >= 0

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Query: IS1 [M=104] Fatal exception (source file p7_pipeline.c, line 697): Target sequence length > 100K, over comparison pipeline limit. (Did you mean to use nhmmer/nhmmscan?) An error occurred when calling ['hmmsearch', '--noali', '--domT', '0.0', '--domtblout', 'digis_genbank/hmmer/NC_002608.1_hmmsearch.hmmer3', '/home/leua/digIS/data/models/hmm/hmm_all_subfams.hmm', 'digis_genbank/pep/NC_002608.1.pep']. Command '['hmmsearch', '--noali', '--domT', '0.0', '--domtblout', 'digis_genbank/hmmer/NC_002608.1_hmmsearch.hmmer3', '/home/leua/digIS/data/models/hmm/hmm_all_subfams.hmm', 'digis_genbank/pep/NC_002608.1.pep']' died with <Signals.SIGABRT: 6>.

phmmer :: search a protein sequence against a protein database

HMMER 3.3.2 (Nov 2020); http://hmmer.org/

Copyright (C) 2020 Howard Hughes Medical Institute.

Freely distributed under the BSD open source license.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

query sequence file: /home/leua/digIS/data/models/fasta/outliers.fasta

target sequence database: digis_genbank/pep/NC_002608.1.pep

per-dom hits tabular output: digis_genbank/hmmer/NC_002608.1_phmmer.hmmer3

show alignments in output: no

domain reporting threshold: score >= 0

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Query: ISNCY_ISAcar1.1 [L=149] Fatal exception (source file p7_pipeline.c, line 697): Target sequence length > 100K, over comparison pipeline limit. (Did you mean to use nhmmer/nhmmscan?) An error occurred when calling ['phmmer', '--noali', '--domT', '0.0', '--domtblout', 'digis_genbank/hmmer/NC_002608.1_phmmer.hmmer3', '/home/leua/digIS/data/models/fasta/outliers.fasta', 'digis_genbank/pep/NC_002608.1.pep']. Command '['phmmer', '--noali', '--domT', '0.0', '--domtblout', 'digis_genbank/hmmer/NC_002608.1_phmmer.hmmer3', '/home/leua/digIS/data/models/fasta/outliers.fasta', 'digis_genbank/pep/NC_002608.1.pep']' died with <Signals.SIGABRT: 6>. Parsing Hmmer outputs... Filtering hits with e-value higher than 0.001. Seed Merging... Number of records before merging: 0. Number of records after merging: 0. Seed Extension... Filtering hits by cut off thresholds. Filtering hits shorter or equal than 150 bp Classification with GenBank annotation... ===== Exporting outputs ===== Exporting records... Exporting summary statistics...

Do you know why that is?

Thanks, Andy

janka2012 commented 3 years ago

Hi @aleuUH,

Thank you for your interest in using the digIS tool. The issue you reported comes from hmmer (https://github.com/EddyRivasLab/hmmer/issues/205). It seems that the developers of hmmer added some extra checks into the latest version. I will try to look at it. In the meantime, please use older versions of hmmer (I tested with hmmer-3.1b2). I also merged the latest updates and created a new release so make sure you have the latest version.

aleuUH commented 3 years ago

thanks!

Seems to be working now

fmalmeida commented 2 years ago

Hello Janka,

I was having the same issue discussed above with a sample. Although it now is fixed for the test data when using the latest version, and this error does not occur any more with this sample, I am having a different error:

Parsing Hmmer outputs...
Traceback (most recent call last):
  File "/workspace/digIS-digISv1.2/digIS_search.py", line 26, in <module>
    digIS.run()
  File "/workspace/digIS-digISv1.2/src/search_tool/digISMultifasta.py", line 23, in run
    digIS_rec.run(search=search)
  File "/workspace/digIS-digISv1.2/src/search_tool/digIS.py", line 313, in run
    self.parse()
  File "/workspace/digIS-digISv1.2/src/search_tool/digIS.py", line 57, in parse
    self.parse_hmmer_output('model', self.hmmsearch_output)
  File "/workspace/digIS-digISv1.2/src/search_tool/digIS.py", line 70, in parse_hmmer_output
    self.recs.append(RecordDigIS.from_hmmer(hsp, sid, dna_range[0], dna_range[1], strand, self.genome.name,
  File "/workspace/digIS-digISv1.2/src/search_tool/RecordDigIS.py", line 31, in from_hmmer
    return cls(genome_name, chrom, genome_seq, seq_len, hsp.qid, sid, hsp.qstart, hsp.qend, start, end, strand, float(hsp.acc), float(hsp.dom_bitscore), float(hsp.dom_evalue))
  File "/workspace/digIS-digISv1.2/src/search_tool/RecordDigIS.py", line 13, in __init__
    super().__init__(genome_name, chrom, start, end, strand, genome_seq, genome_len)
  File "/workspace/digIS-digISv1.2/src/common/grange.py", line 13, in __init__
    raise ValueError("GRange: start or end position is out of the range.")
ValueError: GRange: start or end position is out of the range.

The command is: python3 digIS-digISv1.2/digIS_search.py -i strain_1.fna -g strain_1.gbk

I have tested with both the newest and the old 3.1b2 hmmer versions and with biopython versions from 1.74 to 1.79.

When using the version 1.0 with hmmer-3.1b2 it goes when until exporting the csv where it raises the following:

raise ValueError("Number of elements in header and in row to write is not same.")\
ValueError: Number of elements in header and in row to write is not same.

I have figured that this is raised by a hit with a peptide predicted in the very beginning of the contig, from region 1 to 186.

However Biopython is zero-based so the hmmer parser reports it as 0 to 185, which raises the error in line 13 of grange.py that does not accept start <= 0.

If changing start <= 0 to start < 0 it solves, however the sequence is reported as beggining from base 0 in the results, which I don't think is the golden standard.

Otherwise, we could add a line in the grange.py to add 1 base to the start and end variables, which could solve change the zero-based scheme to one-based start.

Any ideas on the best way to solve it?

janka2012 commented 2 years ago

Hi Felipe,

Thank you for reporting this is issue. Could you please share your input data with me, so I can test it and hopefully fix it on my side? Thank you.

fmalmeida commented 2 years ago

Hi,

Sure. I cannot share with you all the data, but I can share a small subset of the contigs in which the contig that raises the problem (contig76) is present.

Contigs:

>contig00074
CTTTAGGCGGAGCACCGACGGAAGGTGTGATTGCTGGCACGCTCTCCCCTTACATCAACA
CCGAAATCAAAAAAGCGACGGAAGGCAACGACACGGCGAATATTCTTGCCCACGGCGTAT
GGGGAGCGATGGAGGCGGCAAGCCAAGGCGGTAGTGCGTTAGGCGGTGCGGCGGCAGCAA
TGACGGGCGAAGTGGGAGCGAAGCTACTTGCCGAGCAACTTTACGGACAAAGCAACCCTG
AACATCTCTCTACCGAGCAGAAACAAACGCTGTCGG
>contig00075
CTTTAGGCGGAGCACCGACGGAAGGTGTGATTGCTGGCACGCTCTCCCCTTACATCAACA
CCGAAATCAAAAAAGCGACGGAAGGCAACGACACGGCGAATATTCTTGCCCACGGCGTAT
GGGGAGCAATGGAAGCGGCAAGCCAAGGAGGTAGTGCGTTAGGCGGTGCGGCGGCAGCAA
TGACGGGCGAAGTGGGAGCGAAGCTACTTGCCGAGCAACTTTACGGACAAAGCAACCCTG
AACATCTCTCTACCGAGCAGAAACAAACGCTGTCGG
>contig00076
GCACAATGCAAAGCTGAATCTTCGTTTTAGGATAGACTGTATTGATGGCTTCCGGGAAGC
CTTTTAAACCGTCTACACAGGCAATAAAAATGTCTTTCAAGCCTCGATTTTGAAGCTCTG
TCAGCACGTTCTCCCAGAACTTCGCATCTTCATTTTCAGCAATCCAAAGCCCCAATAACT
CTTCCCTTGGCATATAAGGCAATAATCTGCTCATCCATTCCTGTGATGCGGGTTTGGTTT
TTCTTGATAAGTTGCGGTTCAAAGGTGCCGTCAC
>contig00077
GTAAGCAAAATGAGCAGCAGGCGGTGATGGAAATTATCGACAGTTTTTGCTTAAAAAATG
CCGTTATAACGGTGGATGCGATGAATACACAGAAGAAAATTGCGGAGAAAATCATCGAAA
AGAAAGGAGATTATGTGATGCCGTTGAAGAAAAATCACCGTCAATTTCAGTCGGAAGTGG
AGGCGTATTTTCACAAAATCAGCCGAGATTGCCCCGAAATGCTTGAGACGTTTGAGGAGG
TAAATGCGGAACGTCGTCGAATTGATGAGCG
>contig00078
CGTTTGATACCGCTTGGGCAACACCTTGTAATTGCTCCGCCGCCCGTAACGCCTCAAAAC
CAACATTCGCTGCTGCCATTGCATTCACTCTGTCGTTTTTACTGCTACCCACCGATTTTG
CAGATTTTACCGTCGAATCGACCGCTTGTATCAGATAATTTGCAATTTCAGTATTGCTAT
ATTCGCTACAATGTTTATTCATAAGCCTCCTAATTACAGCCGGAGTAATTTCACCAATAT
TTTTCAAAAAATAAAATAATTACTTCG

Please, tell me if it is enough.

janka2012 commented 2 years ago

@fmalmeida thanks for sharing the contigs. It should be enough. I hope I will find time to look at it in the upcoming week or two.

fmalmeida commented 2 years ago

Hi @janka2012,

I was playing with digIS in the meantime and I've figured the problem seems to occur when a hmm hit occurs in the very beginning, let's say in base 1. However biopython is zero-based and translates it to 0, not 1.

When adding the few codes to the grange.py file:

[...] # more code
class Grange:
    def __init__(self, genome_name, chrom, start, end, strand, genome_seq, genome_len, circular=True):
        if start == 0 and end > start:        # custom line added
            start += 1                        # custom line added
            end += 1                          # custom line added
        if start <= 0 or end <= 0 or start > genome_len or end > genome_len: 
            raise ValueError("GRange: start or end position is out of the range.")
[...] # more code

It solves the problem for the type of hmm hit that is appearing in this contigs. However, I don't know if the other hits that are not place as this one are being affected or not by the biopython index manner. Neither if this is the best solution.

I am sharing you the code that allowed me to at least run the program with this contigs, but I don't know if it is the best solution.

Hope it helps 😄

fmalmeida commented 2 years ago

Hi @janka2012 ,

Another contig have flagged the same issue even after my custom fix. I will make them available here:

>k141_1155 flag=0 multi=1.7411 len=477
TTGCCGTTTCAATATCCCAAAAACAGCCACTTTTCCTGCTGCTCCACGACCTCGTTTTCCTTTACGAGGTCCACCAAAATAGCTTTCATCCAACTCAATTTGTCCGTCAAATACTTCATCTGCTTCAAGGGCTAAATGATAACTGATGACCTCACGGATTTTTCGGTAAAACAGGATCGCTGAATTGGGCTGAATACCTAGCAAATCTGCGACTGATCGAGCTGTTACTTCTAACACAAAAAATTCAAGGAGTTTCTTCTGTATAGATTTCTTTAATTTACAATGAGTTATCTTCATTTTTGTAGGTTAGCATACTAGCTAATCTACGTCAGCCCCTAAATCTTTCATCAAACCACGTCTACCATCACAGGTAATCGATTGAATTATATAGCCTTTTTCTCTTAACCTATTCAGGGCAAGTTTGTAGTAAATATCTTTTTCGGTTCGCACAAAGTAATGTGAAATCACATTATTTGA

I've figured that for this contig, the program was "loading" the start position as a negative value (-1 to be more precise).