jiarong / VirSorter2

customizable pipeline to identify viral sequences from (meta)genomic data
GNU General Public License v2.0
219 stars 30 forks source link

[Request] Accessory script to convert `final-viral-boundary.tsv` to gff file #70

Open jolespin opened 3 years ago

jolespin commented 3 years ago

I need to count the reads mapped to my viruses found via VirSorter2. However, I've already mapped to the contigs and I would like to use featureCounts to pull out the reads mapped to the viruses (not the flanking regions of the contigs).

Is it possible to produce an accessory script that converts the final-viral-boundary.tsv to a gff3 or gtf file (please not genbank)?

In the meantime, I can try to work on one (I don't have any prophage). Which fields will I use for the boundaries?

(base) -bash-4.2$ python -c "import pandas as pd; print(*pd.read_csv('virsorter_output/final-viral-boundary.tsv',index_col=0, sep='\t').columns, sep='\n')"
trim_orf_index_start
trim_orf_index_end
trim_bp_start
trim_bp_end
trim_pr
trim_pr_max
prox_orf_index_start
prox_orf_index_end
prox_bp_start
prox_bp_end
prox_pr
prox_pr_max
partial
full_orf_index_start
full_orf_index_end
full_bp_start
full_bp_end
pr_full
arc
bac
euk
vir
mix
unaligned
hallmark_cnt
group
shape
seqname_new
jiarong commented 3 years ago

Hi, the trim_bp_start and trim_bp_end are the boundaries used for final viral contigs. Be aware that host region trimming in VS2 is conservative, meaning there might be host region left. You can use specialized prophage extraction tools to clean up.