trinityrnaseq / trinity_community_codebase

Software contributed by Trinity community users to facilitate studies of Trinity-assembled transcriptomes.
3 stars 1 forks source link

Trinity best transcript set - fastaselecth dupicate names in input headers #3

Closed shzadiqbal closed 5 years ago

shzadiqbal commented 5 years ago

hi i am facing an error while running this pipeline https://github.com/trinityrnaseq/trinity_community_codebase/wiki/Trinity-best-transcript-set


cat keep_these | fastaselecth -sel- -in merge.fasta >Transcriptome_reduced.fasta
fastaselecth: fatal error: duplicate entry names in list, alternate header terminators may be neede
```d
please help me to solve this. All other previous pipeline ran smoothly till now.
shzadiqbal commented 5 years ago

The only thing i changed in the pipeline was to pass the gene_to_trans_map file at TransDecoder.LongOrfs step to keep the originial trinity IDs after transdecoder.
This file was generated using a script form trinity package "get_Trinity_gene_to_trans_map.pl". can it be the reason there are duplicated headers remaining at "cat keep_these | fastaselecth -sel- -in merge.fasta >Transcriptome_reduced.fasta" step?

brianjohnhaas commented 5 years ago

Hi,

You might follow up with the developer of that utility, as it's not a core component of Trinity. Or, you might find an alternative way of accomplishing this task. There are many solutions available (google, seqanswers, the trinity forum, etc.)

best,

~b

On Thu, Aug 29, 2019 at 2:28 AM shzadiqbal notifications@github.com wrote:

hi i am facing an error while running this pipeline cat keep_these | fastaselecth -sel- -in merge.fasta

Transcriptome_reduced.fasta fastaselecth: fatal error: duplicate entry names in list, alternate header terminators may be needed please help me to solve this. All other previous pipeline ran smoothly till now.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/trinityrnaseq/trinity_community_codebase/issues/3?email_source=notifications&email_token=ABZRKX2JH42XRWIDUQJ3UD3QG5UA3A5CNFSM4IR6LFXKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HIDNJTA, or mute the thread https://github.com/notifications/unsubscribe-auth/ABZRKX26OX3U7FCMQYZBLJ3QG5UA3ANCNFSM4IR6LFXA .

--

Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas http://broad.mit.edu/~bhaas

mathog commented 5 years ago
fastaselecth -h 
#part of interest
   -hs STRING
   -ht STRING
         Specify an alternate set of -sel FILE delimiters.  The first delimiter
         encountered terminates the string.  Default delimiters are:
         EOL, NULL, tab, space, vertical bar, and colon.  Possible values include:
            C escape sequences: \\, \a, \b, \f, \t, \r, and \n;
            Control characters like ^C;
            Numeric character values: \###, \d###, \o###, and \x## (digital,digital,octal, and hex).
   -hi STRING
         Specify an alternate set of -input FILE delimiters.  Default is "\1 \t".
         Syntax is the same as for -hs.

The problem is usually that the headers in the input file look something like

>library_num:entry_num Where many entries have the same library_num. The select parser returns "library_num" for all of them because ':' is a delimiter. The solution in that case would be to add -ht ' \t' which makes only space and tab delimiters. Then the identifier would be "library_num:entry_num" and that should differ for all entries (unless there are actually duplications). In the worst case the select fasta file may have very different header format than the file which is being selected. In that case you can use perl or some other scripting language to reformat the input on the fly in a pipe. Example using drm_tools extract:

head -1 select.fasta
>random_num|library_num:entry_num:quality
head -1 input.fasta
>library_num:entry_num
#solution using extract
extract -in select.fasta -dl '|: \t' -if '^>' -ifonly \
  -fmt '>[2]:[3]' \
  | fastaselecth -sel - -in input.fasta -ht ' \t' -out output.fasta

This selects the 2nd and 3rd tokens from the select file and reformats them to match the input file. The lines which are not header lines are dropped.

shzadiqbal commented 5 years ago

by looking the .pep and fasta files i noticed that the pep file has .p1 , .p2 etc which are removed when the extract cmd was ran in the "keep-theese" file. The reason for this is Transdecoder has generated multiple entries of every isoform adding .p1 and p2. however removing p1 and p2 suffix cuased multiple entries with same header in the keep-these files which i this is easy to remove using uniq keep-these >keep-theese1 We only need one entry in of each isoform selected. Please let me know if you think otherwise.

mathog commented 5 years ago

You are correct, only one entry is required at that point. The instructions have been modified to put uniq into that command. Note that the later step which makes another "keep_these" file retains .p1,.p2,etc., so adding uniq there should do nothing - there should not be any duplicates there.