Closed mihinduk closed 4 years ago
Hi Kathie,
My apologies for the slow response. I've been on vacation for the past week.
I think I understand your issue, and Cenote-Taker2 should be reporting the contigs in 'non_viral_domains_contigs.fna' exactly as you are requesting it. One consideration is that it will only retain header information before the first whitespace character. Is it possible that you included a whitespace directly after the '>' character?
If possible please attach the create a new .fasta file from "contig_1689" formatted like your original .fasta and I'll see if I can recapitulate the error.
Best,
Mike
P.S. Here's an example from the test contigs. Original .fasta input:
From non_viral_domains_contigs.fna:
Hi Mike,
I hope you had a nice vacation. Thanks for getting back to me. I made 2 files: contig_1689.fasta = the input I used, but just for contig 1689 contig_1689_non_viral_domains_contigs.fna = the output for contig 1689 only that I got from the command:
conda activate /mnt/pathogen1/rrodgers/miniconda2/envs/cenote-taker2_env
MIN=1000
nohup python /mnt/pathogen1/rrodgers/Cenote-Taker2/run_cenote-taker2.0.1.py \ --contigs /mnt/pathogen1/kathiem/2020_03_20_IBS_virome/KM_ct2/other_contigs/non_viral_domains_contigs.fna \ --run_title KM_ct2_RNA \ --template_file ../template.sbt \ --mem 80 --cpu 20 \ --virus_domain_db rna_virus \ --prune_prophage FALSE \ --filter_out_plasmids FALSE \ --minimum_length_circular $MIN \ --minimum_length_linear $MIN \ --hhsuite_tool hhsearch \ --handle_contigs_without_hallmark sketch_all > out.log 2>&1 &
Thank you for your help, Kathie
From: Mike Tisza notifications@github.com Sent: Wednesday, July 1, 2020 10:09 AM To: mtisza1/Cenote-Taker2 Cenote-Taker2@noreply.github.com Cc: Mihindukulasuriya, Kathie mihindu@wustl.edu; Author author@noreply.github.com Subject: Re: [mtisza1/Cenote-Taker2] Linking rna_virus search to original contig (#2)
Hi Kathie,
My apologies for the slow response. I've been on vacation for the past week.
I think I understand your issue, and Cenote-Taker2 should be reporting the contigs in 'non_viral_domains_contigs.fna' exactly as you are requesting it. One consideration is that it will only retain header information before the first whitespace character. Is it possible that you included a whitespace directly after the '>' character?
If possible please attach the create a new .fasta file from "contig_1689" formatted like your original .fasta and I'll see if I can recapitulate the error.
Best,
Mike
P.S. Here's an example from the test contigs. Original .fasta input: [image]https://user-images.githubusercontent.com/37546741/86260284-0f507200-bb8b-11ea-9b8c-53f4f798145c.png
From non_viral_domains_contigs.fna: [image]https://user-images.githubusercontent.com/37546741/86260373-3018c780-bb8b-11ea-9e5a-b2d019d27fc7.png
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/mtisza1/Cenote-Taker2/issues/2#issuecomment-652476244, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ANDVLDI254XBFGV6IBFDIMDRZNGSHANCNFSM4OIOOM5Q.
The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail.
Kathie,
I'm not seeing the files you referred to. Could you try attaching them once more or send them to my email at michael.tisza@gmail.com?
Mike
OK I got your files, and they look as I expected. I think I misunderstood your question.
I thought you just wanted the string 'contig_1689' present in the output files, which it does in the various .fna, .fsa, .gbf, and tsv files. Were you hoping that the files produced from the analysis of contig_1689 would contain the string 'contig_1689' in the file name? This would cause some issues for me and I wouldn't really want to do it. That said, you could write a bash script renaming the files after the run is over. Such as with the script below. Let me know if I'm understanding you correctly, and, if I'm not, please be more explicit about what you'd like from the output.
#!/bin/bash
for FSA in *fsa ; do
ORIGINAL_TITLE=$( head -n1 $FSA | sed 's/.*note= \(.*\) ; .*/\1/' )
echo ${FSA%.fsa} to ${ORIGINAL_TITLE}
mv $FSA ${ORIGINAL_TITLE}_${FSA}
mv ${FSA%.fsa}.gbf ${ORIGINAL_TITLE}_${FSA%.fsa}.gbf
mv ${FSA%.fsa}.cmt ${ORIGINAL_TITLE}_${FSA%.fsa}.cmt
mv ${FSA%.fsa}.tbl ${ORIGINAL_TITLE}_${FSA%.fsa}.tbl
mv ${FSA%.fsa}.val ${ORIGINAL_TITLE}_${FSA%.fsa}.val
mv ${FSA%.fsa}.sqn ${ORIGINAL_TITLE}_${FSA%.fsa}.sqn
done
Hi Mike,
What I was hoping to capture was the original contig name in the output of the rna_virus search. So, when I submit my original file for the initial search: MIN=1000
python /mnt/pathogen1/rrodgers/Cenote-Taker2/run_cenote-taker2.0.1.py \ --contigs /mnt/pathogen1/kathiem/2020_03_20_IBS_virome/assembly/contig_dictionary/contig_dictionary.fasta \ --run_title KM_ct2 \ --template_file ../template.sbt \ --mem 80 --cpu 20 \ --prune_prophage FALSE \ --filter_out_plasmids FALSE \ --minimum_length_circular $MIN \ --minimum_length_linear $MIN \ --hhsuite_tool hhsearch \ --handle_contigs_without_hallmark sketch_all
My infile consists of named contigs: Ex:
contig_1689 contig_191 contig_2107
and the outfile has the column "Cenote-taker contig name" ct2_contig_dictionary1006 ct2_contig_dictionary1525 ct2_contig_dictionary1526
and the column "original contig name" which contains the contig names: contig_191 contig_2385 contig_2386
The input fasta for the --virus_domain_db rna_virus run contains both these names, separated by a space:
KM_ct2761 contig_1689 KM_ct21225 contig_2107 KM_ct21484 contig_2347 KM_ct2188 contig_492
The output file from the --virus_domain_db rna_virus run has the column "Cenote-taker contig name" KM_ct2_RNA1551_vs01 KM_ct2_RNA198_vs01 KM_ct2_RNA432_vs01 KM_ct2_RNA681_vs01
and the column "original contig name" which contains the contig names: KM_ct2761 KM_ct21225 KM_ct21484 KM_ct2188
I would like the "original contig name" column to contain either: KM_ct2761 contig_1689
OR contig_1689
as I now have an added parsing step, where I have to go back to the input fasta for the --virus_domain_db rna_virus run (non_viral_domains_contigs.fna) and search for the "original contig name" to link the original contig name in my 1st fasta infile (contig_dictionary.fasta), which is how I have the contigs linked to my metadata. I was hoping to avoid the extra parsing step and be able to read the RNA output into R to link it with my metadata for analysis.
Thank you, Kathie
From: Mike Tisza notifications@github.com Sent: Wednesday, July 1, 2020 2:52 PM To: mtisza1/Cenote-Taker2 Cenote-Taker2@noreply.github.com Cc: Mihindukulasuriya, Kathie mihindu@wustl.edu; Author author@noreply.github.com Subject: Re: [mtisza1/Cenote-Taker2] Linking rna_virus search to original contig (#2)
OK I got your files, and they look as I expected. I think I misunderstood your question.
I thought you just wanted the string 'contig_1689' present in the output files, which it does in the various .fna, .fsa, .gbf, and tsv files. Were you hoping that the files produced from the analysis of contig_1689 would contain the string 'contig_1689' in the file name? This would cause some issues for me and I wouldn't really want to do it. That said, you could write a bash script renaming the files after the run is over. Such as with the script below. Let me know if I'm understanding you correctly, and, if I'm not, please be more explicit about what you'd like from the output.
for FSA in fsa ; do ORIGINAL_TITLE=$( head -n1 $FSA | sed 's/.note= (.) ; ./\1/' ) echo ${FSA%.fsa} to ${ORIGINAL_TITLE} mv $FSA ${ORIGINAL_TITLE}${FSA} mv ${FSA%.fsa}.gbf ${ORIGINAL_TITLE}${FSA%.fsa}.gbf mv ${FSA%.fsa}.cmt ${ORIGINAL_TITLE}${FSA%.fsa}.cmt mv ${FSA%.fsa}.tbl ${ORIGINAL_TITLE}${FSA%.fsa}.tbl mv ${FSA%.fsa}.val ${ORIGINAL_TITLE}${FSA%.fsa}.val mv ${FSA%.fsa}.sqn ${ORIGINAL_TITLE}${FSA%.fsa}.sqn done
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/mtisza1/Cenote-Taker2/issues/2#issuecomment-652615116, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ANDVLDLBELMOTLVSSQVXYQLRZOHZDANCNFSM4OIOOM5Q.
The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail.
OK so I was clearly misunderstanding your question before. I'm sorry about that.
There are a variety of reasons that I'm not willing to change the pipeline to preserve the fasta header information beyond the first whitespace character, so, unfortunately, you'll have to parse the 'non_viral_domains_contigs.fna' file before feeding it to another cenote-taker2 run. A quick manipulation would be to remove the space character in (from your example) 'KM_ct2761 contig_1689'. For example you could change it to an '@' character (KM_ct2761@contig_1689), so the information would all be preserved:
sed 's/ /@/g' non_viral_domains_contigs.fna > contigs_for_next_run.fna
I don't think this was the answer you were hoping for, but I'm hoping it won't be too onerous either.
Best,
Mike
Hi Mike,
Thanks for the reply. Now that I know the format of the output, you are right, the simplest fix on my end will be to replace the space with a character before submitting it, so I can parse it when I get the results.
Kathie
From: Mike Tisza notifications@github.com Sent: Thursday, July 2, 2020 10:18 AM To: mtisza1/Cenote-Taker2 Cenote-Taker2@noreply.github.com Cc: Mihindukulasuriya, Kathie mihindu@wustl.edu; Author author@noreply.github.com Subject: Re: [mtisza1/Cenote-Taker2] Linking rna_virus search to original contig (#2)
OK so I was clearly misunderstanding your question before. I'm sorry about that.
There are a variety of reasons that I'm not willing to change the pipeline to preserve the fasta header information beyond the first whitespace character, so, unfortunately, you'll have to parse the 'non_viral_domains_contigs.fna' file before feeding it to another cenote-taker2 run. A quick manipulation would be to remove the space character in (from your example) 'KM_ct2761 contig_1689'. For example you could change it to an '@' character (KM_ct2761@contig_1689), so the information would all be preserved:
sed 's/ /@/g' non_viral_domains_contigs.fna > contigs_for_next_run.fna
I don't think this was the answer you were hoping for, but I'm hoping it won't be too onerous either.
Best,
Mike
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/mtisza1/Cenote-Taker2/issues/2#issuecomment-653068469, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ANDVLDJUVOBN3NII4YXHXMTRZSQKRANCNFSM4OIOOM5Q.
The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail.
OK, great. And, please make me aware of any additional issues you encounter.
I am now closing this issue.
Hi,
Is it possible to have the original contig name in the output include the information after the space so that the name of the contig from the original contig dictionary would be captured?
Thank you, Kathie Mihindukulasuriya