Closed shzadiqbal closed 5 years ago
The only thing i changed in the pipeline was to pass the gene_to_trans_map file at TransDecoder.LongOrfs step to keep the originial trinity IDs after transdecoder.
This file was generated using a script form trinity package "get_Trinity_gene_to_trans_map.pl". can it be the reason there are duplicated headers remaining at "cat keep_these | fastaselecth -sel- -in merge.fasta >Transcriptome_reduced.fasta" step?
Hi,
You might follow up with the developer of that utility, as it's not a core component of Trinity. Or, you might find an alternative way of accomplishing this task. There are many solutions available (google, seqanswers, the trinity forum, etc.)
best,
~b
On Thu, Aug 29, 2019 at 2:28 AM shzadiqbal notifications@github.com wrote:
hi i am facing an error while running this pipeline cat keep_these | fastaselecth -sel- -in merge.fasta
Transcriptome_reduced.fasta fastaselecth: fatal error: duplicate entry names in list, alternate header terminators may be needed please help me to solve this. All other previous pipeline ran smoothly till now.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/trinityrnaseq/trinity_community_codebase/issues/3?email_source=notifications&email_token=ABZRKX2JH42XRWIDUQJ3UD3QG5UA3A5CNFSM4IR6LFXKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HIDNJTA, or mute the thread https://github.com/notifications/unsubscribe-auth/ABZRKX26OX3U7FCMQYZBLJ3QG5UA3ANCNFSM4IR6LFXA .
Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas http://broad.mit.edu/~bhaas
fastaselecth -h
#part of interest
-hs STRING
-ht STRING
Specify an alternate set of -sel FILE delimiters. The first delimiter
encountered terminates the string. Default delimiters are:
EOL, NULL, tab, space, vertical bar, and colon. Possible values include:
C escape sequences: \\, \a, \b, \f, \t, \r, and \n;
Control characters like ^C;
Numeric character values: \###, \d###, \o###, and \x## (digital,digital,octal, and hex).
-hi STRING
Specify an alternate set of -input FILE delimiters. Default is "\1 \t".
Syntax is the same as for -hs.
The problem is usually that the headers in the input file look something like
>library_num:entry_num
Where many entries have the same library_num. The select parser returns "library_num" for all of them because ':' is a delimiter. The solution in that case would be to add
-ht ' \t'
which makes only space and tab delimiters. Then the identifier would be "library_num:entry_num"
and that should differ for all entries (unless there are actually duplications). In the worst case the select fasta file may have very different header format than the file which is being selected. In that case you can use perl or some other scripting language to reformat the input on the fly in a pipe.
Example using drm_tools extract:
head -1 select.fasta
>random_num|library_num:entry_num:quality
head -1 input.fasta
>library_num:entry_num
#solution using extract
extract -in select.fasta -dl '|: \t' -if '^>' -ifonly \
-fmt '>[2]:[3]' \
| fastaselecth -sel - -in input.fasta -ht ' \t' -out output.fasta
This selects the 2nd and 3rd tokens from the select file and reformats them to match the input file. The lines which are not header lines are dropped.
by looking the .pep and fasta files i noticed that the pep file has .p1 , .p2 etc which are removed when the extract cmd was ran in the "keep-theese" file.
The reason for this is Transdecoder has generated multiple entries of every isoform adding .p1 and p2. however removing p1 and p2 suffix cuased multiple entries with same header in the keep-these files which i this is easy to remove using
uniq keep-these >keep-theese1
We only need one entry in of each isoform selected.
Please let me know if you think otherwise.
You are correct, only one entry is required at that point. The instructions have been modified to put uniq into that command. Note that the later step which makes another "keep_these" file retains .p1,.p2,etc., so adding uniq there should do nothing - there should not be any duplicates there.
hi i am facing an error while running this pipeline https://github.com/trinityrnaseq/trinity_community_codebase/wiki/Trinity-best-transcript-set