mtisza1 / Cenote-Taker2

Cenote-Taker2: Discover and Annotate Divergent Viral Contigs (Please use Cenote-Taker 3 instead)
MIT License
55 stars 7 forks source link

Linking rna_virus search to original contig #2

Closed mihinduk closed 4 years ago

mihinduk commented 4 years ago

Hi,

Is it possible to have the original contig name in the output include the information after the space so that the name of the contig from the original contig dictionary would be captured?

KM_ct2761 contig_1689 (from non_viral_domains_contigs.fna) would be reported as KM_ct2761 contig_1689, not just KM_ct2761. This would save a lookup step.

Thank you, Kathie Mihindukulasuriya

mtisza1 commented 4 years ago

Hi Kathie,

My apologies for the slow response. I've been on vacation for the past week.

I think I understand your issue, and Cenote-Taker2 should be reporting the contigs in 'non_viral_domains_contigs.fna' exactly as you are requesting it. One consideration is that it will only retain header information before the first whitespace character. Is it possible that you included a whitespace directly after the '>' character?

If possible please attach the create a new .fasta file from "contig_1689" formatted like your original .fasta and I'll see if I can recapitulate the error.

Best,

Mike

P.S. Here's an example from the test contigs. Original .fasta input: image

From non_viral_domains_contigs.fna: image

mihinduk commented 4 years ago

Hi Mike,

I hope you had a nice vacation. Thanks for getting back to me. I made 2 files: contig_1689.fasta = the input I used, but just for contig 1689 contig_1689_non_viral_domains_contigs.fna = the output for contig 1689 only that I got from the command:

conda activate /mnt/pathogen1/rrodgers/miniconda2/envs/cenote-taker2_env

MIN=1000

nohup python /mnt/pathogen1/rrodgers/Cenote-Taker2/run_cenote-taker2.0.1.py \ --contigs /mnt/pathogen1/kathiem/2020_03_20_IBS_virome/KM_ct2/other_contigs/non_viral_domains_contigs.fna \ --run_title KM_ct2_RNA \ --template_file ../template.sbt \ --mem 80 --cpu 20 \ --virus_domain_db rna_virus \ --prune_prophage FALSE \ --filter_out_plasmids FALSE \ --minimum_length_circular $MIN \ --minimum_length_linear $MIN \ --hhsuite_tool hhsearch \ --handle_contigs_without_hallmark sketch_all > out.log 2>&1 &

Thank you for your help, Kathie


From: Mike Tisza notifications@github.com Sent: Wednesday, July 1, 2020 10:09 AM To: mtisza1/Cenote-Taker2 Cenote-Taker2@noreply.github.com Cc: Mihindukulasuriya, Kathie mihindu@wustl.edu; Author author@noreply.github.com Subject: Re: [mtisza1/Cenote-Taker2] Linking rna_virus search to original contig (#2)

Hi Kathie,

My apologies for the slow response. I've been on vacation for the past week.

I think I understand your issue, and Cenote-Taker2 should be reporting the contigs in 'non_viral_domains_contigs.fna' exactly as you are requesting it. One consideration is that it will only retain header information before the first whitespace character. Is it possible that you included a whitespace directly after the '>' character?

If possible please attach the create a new .fasta file from "contig_1689" formatted like your original .fasta and I'll see if I can recapitulate the error.

Best,

Mike

P.S. Here's an example from the test contigs. Original .fasta input: [image]https://user-images.githubusercontent.com/37546741/86260284-0f507200-bb8b-11ea-9b8c-53f4f798145c.png

From non_viral_domains_contigs.fna: [image]https://user-images.githubusercontent.com/37546741/86260373-3018c780-bb8b-11ea-9e5a-b2d019d27fc7.png

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/mtisza1/Cenote-Taker2/issues/2#issuecomment-652476244, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ANDVLDI254XBFGV6IBFDIMDRZNGSHANCNFSM4OIOOM5Q.


The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail.

mtisza1 commented 4 years ago

Kathie,

I'm not seeing the files you referred to. Could you try attaching them once more or send them to my email at michael.tisza@gmail.com?

Mike

mtisza1 commented 4 years ago

OK I got your files, and they look as I expected. I think I misunderstood your question.

I thought you just wanted the string 'contig_1689' present in the output files, which it does in the various .fna, .fsa, .gbf, and tsv files. Were you hoping that the files produced from the analysis of contig_1689 would contain the string 'contig_1689' in the file name? This would cause some issues for me and I wouldn't really want to do it. That said, you could write a bash script renaming the files after the run is over. Such as with the script below. Let me know if I'm understanding you correctly, and, if I'm not, please be more explicit about what you'd like from the output.



#!/bin/bash

for FSA in *fsa ; do
    ORIGINAL_TITLE=$( head -n1 $FSA | sed 's/.*note= \(.*\) ; .*/\1/' )
    echo ${FSA%.fsa} to ${ORIGINAL_TITLE} 
    mv $FSA ${ORIGINAL_TITLE}_${FSA}
    mv ${FSA%.fsa}.gbf ${ORIGINAL_TITLE}_${FSA%.fsa}.gbf
    mv ${FSA%.fsa}.cmt ${ORIGINAL_TITLE}_${FSA%.fsa}.cmt
    mv ${FSA%.fsa}.tbl ${ORIGINAL_TITLE}_${FSA%.fsa}.tbl
    mv ${FSA%.fsa}.val ${ORIGINAL_TITLE}_${FSA%.fsa}.val
    mv ${FSA%.fsa}.sqn ${ORIGINAL_TITLE}_${FSA%.fsa}.sqn
done
mihinduk commented 4 years ago

Hi Mike,

What I was hoping to capture was the original contig name in the output of the rna_virus search. So, when I submit my original file for the initial search: MIN=1000

python /mnt/pathogen1/rrodgers/Cenote-Taker2/run_cenote-taker2.0.1.py \ --contigs /mnt/pathogen1/kathiem/2020_03_20_IBS_virome/assembly/contig_dictionary/contig_dictionary.fasta \ --run_title KM_ct2 \ --template_file ../template.sbt \ --mem 80 --cpu 20 \ --prune_prophage FALSE \ --filter_out_plasmids FALSE \ --minimum_length_circular $MIN \ --minimum_length_linear $MIN \ --hhsuite_tool hhsearch \ --handle_contigs_without_hallmark sketch_all

My infile consists of named contigs: Ex:

contig_1689 contig_191 contig_2107

and the outfile has the column "Cenote-taker contig name" ct2_contig_dictionary1006 ct2_contig_dictionary1525 ct2_contig_dictionary1526

and the column "original contig name" which contains the contig names: contig_191 contig_2385 contig_2386

The input fasta for the --virus_domain_db rna_virus run contains both these names, separated by a space:

KM_ct2761 contig_1689 KM_ct21225 contig_2107 KM_ct21484 contig_2347 KM_ct2188 contig_492

The output file from the --virus_domain_db rna_virus run has the column "Cenote-taker contig name" KM_ct2_RNA1551_vs01 KM_ct2_RNA198_vs01 KM_ct2_RNA432_vs01 KM_ct2_RNA681_vs01

and the column "original contig name" which contains the contig names: KM_ct2761 KM_ct21225 KM_ct21484 KM_ct2188

I would like the "original contig name" column to contain either: KM_ct2761 contig_1689

OR contig_1689

as I now have an added parsing step, where I have to go back to the input fasta for the --virus_domain_db rna_virus run (non_viral_domains_contigs.fna) and search for the "original contig name" to link the original contig name in my 1st fasta infile (contig_dictionary.fasta), which is how I have the contigs linked to my metadata. I was hoping to avoid the extra parsing step and be able to read the RNA output into R to link it with my metadata for analysis.

Thank you, Kathie


From: Mike Tisza notifications@github.com Sent: Wednesday, July 1, 2020 2:52 PM To: mtisza1/Cenote-Taker2 Cenote-Taker2@noreply.github.com Cc: Mihindukulasuriya, Kathie mihindu@wustl.edu; Author author@noreply.github.com Subject: Re: [mtisza1/Cenote-Taker2] Linking rna_virus search to original contig (#2)

OK I got your files, and they look as I expected. I think I misunderstood your question.

I thought you just wanted the string 'contig_1689' present in the output files, which it does in the various .fna, .fsa, .gbf, and tsv files. Were you hoping that the files produced from the analysis of contig_1689 would contain the string 'contig_1689' in the file name? This would cause some issues for me and I wouldn't really want to do it. That said, you could write a bash script renaming the files after the run is over. Such as with the script below. Let me know if I'm understanding you correctly, and, if I'm not, please be more explicit about what you'd like from the output.


!/bin/bash

for FSA in fsa ; do ORIGINAL_TITLE=$( head -n1 $FSA | sed 's/.note= (.) ; ./\1/' ) echo ${FSA%.fsa} to ${ORIGINAL_TITLE} mv $FSA ${ORIGINAL_TITLE}${FSA} mv ${FSA%.fsa}.gbf ${ORIGINAL_TITLE}${FSA%.fsa}.gbf mv ${FSA%.fsa}.cmt ${ORIGINAL_TITLE}${FSA%.fsa}.cmt mv ${FSA%.fsa}.tbl ${ORIGINAL_TITLE}${FSA%.fsa}.tbl mv ${FSA%.fsa}.val ${ORIGINAL_TITLE}${FSA%.fsa}.val mv ${FSA%.fsa}.sqn ${ORIGINAL_TITLE}${FSA%.fsa}.sqn done

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/mtisza1/Cenote-Taker2/issues/2#issuecomment-652615116, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ANDVLDLBELMOTLVSSQVXYQLRZOHZDANCNFSM4OIOOM5Q.


The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail.

mtisza1 commented 4 years ago

OK so I was clearly misunderstanding your question before. I'm sorry about that.

There are a variety of reasons that I'm not willing to change the pipeline to preserve the fasta header information beyond the first whitespace character, so, unfortunately, you'll have to parse the 'non_viral_domains_contigs.fna' file before feeding it to another cenote-taker2 run. A quick manipulation would be to remove the space character in (from your example) 'KM_ct2761 contig_1689'. For example you could change it to an '@' character (KM_ct2761@contig_1689), so the information would all be preserved:

sed 's/ /@/g' non_viral_domains_contigs.fna > contigs_for_next_run.fna

I don't think this was the answer you were hoping for, but I'm hoping it won't be too onerous either.

Best,

Mike

mihinduk commented 4 years ago

Hi Mike,

Thanks for the reply. Now that I know the format of the output, you are right, the simplest fix on my end will be to replace the space with a character before submitting it, so I can parse it when I get the results.

Kathie


From: Mike Tisza notifications@github.com Sent: Thursday, July 2, 2020 10:18 AM To: mtisza1/Cenote-Taker2 Cenote-Taker2@noreply.github.com Cc: Mihindukulasuriya, Kathie mihindu@wustl.edu; Author author@noreply.github.com Subject: Re: [mtisza1/Cenote-Taker2] Linking rna_virus search to original contig (#2)

OK so I was clearly misunderstanding your question before. I'm sorry about that.

There are a variety of reasons that I'm not willing to change the pipeline to preserve the fasta header information beyond the first whitespace character, so, unfortunately, you'll have to parse the 'non_viral_domains_contigs.fna' file before feeding it to another cenote-taker2 run. A quick manipulation would be to remove the space character in (from your example) 'KM_ct2761 contig_1689'. For example you could change it to an '@' character (KM_ct2761@contig_1689), so the information would all be preserved:

sed 's/ /@/g' non_viral_domains_contigs.fna > contigs_for_next_run.fna

I don't think this was the answer you were hoping for, but I'm hoping it won't be too onerous either.

Best,

Mike

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/mtisza1/Cenote-Taker2/issues/2#issuecomment-653068469, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ANDVLDJUVOBN3NII4YXHXMTRZSQKRANCNFSM4OIOOM5Q.


The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail.

mtisza1 commented 4 years ago

OK, great. And, please make me aware of any additional issues you encounter.

I am now closing this issue.