sestaton / HMMER2GO

Annotate DNA sequences for Gene Ontology terms
MIT License
40 stars 10 forks source link

Error with Emboss identifiers #7

Closed jirivorel closed 8 years ago

jirivorel commented 8 years ago

Hi Evan, I wanna use Hmmer2go - it's great tool for my purpose, but after run this simple command ...

$ hmmer2go getorf -i Data/nk_seq -o prot_seq_trans -l 90

I get an error message ...

ERROR: Identifiers such as 'Locus_1_Transcript_1/1_Confidence_1.000_Length_826' will produce unexpected renaming with EMBOSS. Exiting. at /usr/local/share/perl/5.18.2/HMMER2GO/Command/getorf.pm line 161, <$fh> chunk 2.

I am not familiar with Emboss, so i am not sure, what is wrong with my seq identifiers.

Thank you for your time and reply

Jirka

sestaton commented 8 years ago

The issue is the slash in the FASTA header, which will cause EMBOSS to reformat your IDs or just halt. You can fix your DNA sequences with this script: clean_multifasta.pl. That should solve the problem for you.

Let me know if you have any other questions or issues.

jirivorel commented 8 years ago

Many thanks for the prompt reply now is it ok without error message. But still I've some problems ...

This is a couple of problems that prevent me get the coding sequence of amino acids and nucleotides equivalent.

sestaton commented 8 years ago

Thanks for the report. The first issue is fixed on the master branch (reporting of ORF type), and I'm looking into the second issue (the number or ORFs reported).

sestaton commented 8 years ago

Okay, I have looked at the tests and I'll need to see an example to help with the last issue you mention. In the output I get, I am seeing only the longest ORF for each input sequence. For a multi-FASTA file, that still means there will be many output sequences, specifically, there will be one for every input sequence as long as it passes the thresholds.

jirivorel commented 8 years ago

OK, here is a short fasta file with 21 nucl. sequences (https://github.com/sestaton/HMMER2GO/files/154531/nucl_seq_input.fasta.zip) I used single FASTA file as an input. After writting this command:

hmmer2go getorf -i nucl_seq_input.fasta -o prot_seq_minlen_100.fasta -l 100

I get FASTA output in aminoacids with 1214 sequnces - this ... (https://github.com/sestaton/HMMER2GO/files/154539/prot_seq_minlen_100.fasta.zip)

sestaton commented 8 years ago

You might want to check which EMBOSS version you are using, that is likely the difference. I am using EMBOSS:6.5.7.0. With this version, I am getting 725 sequences when reporting all ORFs, and 22 sequences when reporting only the longest using the command you showed. There are 22 sequences reported and not 21 because there are 2 ORFs of the same length and I don't want to choose one over the other since it may influence the results. I will add a warning when this case is encountered.

jirivorel commented 8 years ago

OK, I used version 6.6.0.0. I'll contact your server administrator with request for this older version and I'll try do it again and inform you.

sestaton commented 8 years ago

That is interesting, can you show the output of the command getorf -version to be clear. To my knowledge, version 6.5.7 is the latest and I'm not sure where version 6.6.0.0 can be found. It is not on the FTP servers.

If you download emboss-latest.tar.gz from the FTP site, it is version 6.5.7. The EMBOSS download page also suggests version 6.5.7 is the latest stable version.

jirivorel commented 8 years ago

Yes, I know EMBOSS download page (where version 6.5.7. can be download, it is true), but after writing getorf -version and embossversion I get that version is: EMBOSS:6.6.0.0. Version of HMMER is 3.1b2.

sestaton commented 8 years ago

Aside from the EMBOSS version question, can you confirm you only ran this command:

hmmer2go getorf -i nucl_seq_input.fasta -o prot_seq_minlen_100.fasta -l 100

When I run that I get 22 sequences.

The following command will tell you the hmmer2go version: hmmer2go --version.

jirivorel commented 8 years ago

Yes, I ran this command in the same form and I get 1214 protein sequences. My hmmer2go version is 0.17.1.

sestaton commented 8 years ago

Can you show the output of this command:

grep -c ">" nucl_seq_input.fasta; hmmer2go getorf -i nucl_seq_input.fasta -o prot_seq_minlen_100.fasta -l 100; grep -c ">" prot_seq_minlen_100.fasta

Also, I would recommend updating to the latest version to fix the first issue you mentioned about reporting the ORF type.

jirivorel commented 8 years ago

Output is: 21 1214

sestaton commented 8 years ago

What operating system are you using (and version)? I wouldn't think that the EMBOSS version would influence this, but unfortunately it is hard to test without knowing where to find that version.

My only suggestion is to try the latest hmmer2go version, and once I know your OS I can try it on a cloud instance with the same set up.

jirivorel commented 8 years ago

I am using Ubuntu 14.04.4 LTS (GNU/Linux 3.13.0-32-generic x86_64). But it is our server for computational biology, now I am waiting until our administrator makes my requirements about software - the latest hmmer2go version a EMBOSS 6.5.7 version.

jirivorel commented 8 years ago

So, we upgrade hmmer2go on version 0.17.2 and now it's working right - somehow. I can get right count of ORFs in proteins and nucleotides too. So thank you for your time and patience.

sestaton commented 8 years ago

That is good to hear. I'll close this issue, but don't hesitate to raise other issues if you have any questions. Thanks.

sestaton commented 6 years ago

Hi @abbyhudak,

I'm unclear on what command you have run. The Makefile.PL script just sets up the package to be tested and installed. It does not run any analysis.

If you are having issues with the hmmer2go getorf command, you may want to try the clean_multifasta.pl script listed above in this discussion. Please try that to convert your sequences to a usable format by EMBOSS.

You will also need to build/install the package to use the programs, and it is unclear to me if that was done correctly. Let me know if you have issues with above suggestions.

Thanks.

abbyhudak commented 6 years ago

Sorry, I did not mean to say Makefile.pl I meant clean_multifasta.pl. I tried using the clean_multifasta.pl script but I may not have used it correctly. I am not sure which lines of the code to actually run in my terminal.

sestaton commented 6 years ago

Hi @abbyhudak,

The script should work fine based on the identifier you posted. If you have a file of sequences named "trinity.fas" then you can use the script like so:

perl clean_multifasta.pl -i trinity.fas -o trinity_clean.fas

The file of transformed IDs will be in "trinity_clean.fas" which is the argument to "-o" above, the output file.

sestaton commented 6 years ago

Hi @abbyhudak,

I'm not sure what to make of your comment without more information. Please show the command and the output of the program (it should print results to the terminal). Did check the output file?

sestaton commented 6 years ago

Hi @abbyhudak,

What you describe is unrelated to HMMER2GO or this thread, but I don't mind trying to help. Please send me an email and we can continue the discussion that way. If there is something related to the original issue in this thread we can pick up the discussion here.

Thanks, Evan

sestaton commented 6 years ago

Hi @abbyhudak,

You can email me at: evan@evanstaton.com. The question now is about running a script and it would be better to resolve that offline so others are not getting notifications for each message and so we can keep the discussion here focused on a specific issue.

Thanks, Evan

sestaton commented 6 years ago

FYI, this should not be an issue going forward. I've added a method to modify and store the identifiers (in v0.17.7) so any sequence format should work, and the original file will be untouched.

abbyhudak commented 6 years ago

Great, thanks!

Abby


From: Evan Staton notifications@github.com Sent: Thursday, March 29, 2018 4:23:42 PM To: sestaton/HMMER2GO Cc: abbyhudak; Mention Subject: Re: [sestaton/HMMER2GO] Error with Emboss identifiers (#7)

FYI, this should not be an issue going forward. I've added a method to modify and store the identifiers (in v0.17.7) so any sequence format should work, and the original file will be untouched.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/sestaton/HMMER2GO/issues/7#issuecomment-377401949, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AkDxFdi05V5aN_UFYb2n6T0zPWJuO53uks5tjWz-gaJpZM4HmTHG.

abbyhudak commented 6 years ago

My command: ./clean_multifasta.pl -i Mpotamo.fasta -o Mpotamo_clean.fasta

My error: ./clean_multifasta.pl: line 7: syntax error near unexpected token newline' ./clean_multifasta.pl: line 7:<!DOCTYPE html>'

sestaton commented 6 years ago

Hi @abbyhudak,

Could you create a separate issue for this topic at: https://github.com/sestaton/sesbio/issues

I'd like to keep this message board for the hmmer2go issues. Also, please show a bit of the file for testing. That message suggests there is likely something unexpected with the input.

Thanks.