mdsufz / MuDoGeR

MuDoGeR makes the recovery of genomes from prokaryotes, viruses, and eukaryotes from metagenomes easy.
GNU General Public License v3.0
87 stars 9 forks source link

Virfinder AWK Filtering Maybe not Correct #27

Open benyoung93 opened 3 months ago

benyoung93 commented 3 months ago

Hi there :)

First of all, a wonderful collection of tools and a pipeline !! I have had trouble with some if the installation but I am just chopping and cutting the pipeline to fit my needs.

A quick query for the Virfinder step. So I know you want to do q-value < 0.01 and length > 1000bp. I have run the virfinder rscript file and got the output folder. I just want to double check that the filtering awk script is correct.

head P10_virfinder.tsv

"name"  "length"    "score" "pvalue"    "qvalue"
"1" "k141_270856 flag=1 multi=2.0000 len=320"   320 0.138513157806992   0.588997020854022   0.56810885275558
"2" "k141_33860 flag=1 multi=2.0000 len=361"    361 0.248927602838187   0.42085402184707    0.525288351341074
"3" "k141_237000 flag=1 multi=6.0000 len=478"   478 0.392398453460999   0.271737835153923   0.470800520085438
"4" "k141_93112 flag=1 multi=5.0000 len=419"    419 0.316555183433318   0.343932472691162   0.499862326324076
"5" "k141_177752 flag=1 multi=2.0000 len=398"   398 0.461420194784674   0.219920556107249   0.450421369675625
"6" "k141_321640 flag=1 multi=4.0000 len=383"   383 0.456895550455814   0.222641509433962   0.450905940743768
"7" "k141_211608 flag=1 multi=2.0000 len=477"   477 0.152712635250371   0.562542204568024   0.561554065996342
"8" "k141_16930 flag=1 multi=2.0000 len=426"    426 0.165548747858572   0.539602780536246   0.555654697140864
"9" "k141_8465 flag=1 multi=3.0000 len=431" 431 0.855015816724245   0.0269116186693148  0.268291393791584

The subsequent awk script is then this obviously (ignore all mu '"'"'" this is for a loop script that generates jobs for all of my samples).

cat '"$PALPAL"'_virfinder.tsv | awk -F'\t' '"'"'{ if ($4 <= 0.01) print }'"'"' | awk -F'_' '"'"'{ if ($4 >= 1000) print }'"'"' | cut -f2 | sed "s/\"//g" > '"$PALPAL"'_virfinder_filtered_data.txt

So this awk script is currently firstly using awk on the fourth \t column which is p-value ($4), should this not be $5 as the <0.01 filtering is for q-value ?

Secondly, second awk is extracting the fourth instance after underscore, looking at the output this doesn't seem to be correct. If you are wanting length >1000 would it not be a better idea to do awk -F'\t' '"'"'{ if ($2 >= 1000) print }'"'"'

Looking at the virfinder github it seems the original results format would work with your awk script, but I think a new update may of switched this ?

I used the mamba install in the mudoger install scripts but edited a wee bit (mamba create -n virfinder_env -c bioconda r-virfinder) so I believe both my local and the mudogoer install versions would be the same.

Happy to provide more info if needed, and I apologise in advance if I am barking up the wrong tree, but I was doing some QC and testing and noticed that the awk was not approproate for the output file generated by virfinder :).

Ben

benyoung93 commented 3 months ago

Also, additional thing I have just realised, you need to identify your column and + 1 as the awk takes into account the rowname column in the output from write.table. It may be prudent therefore to put a row.names=F in the r script, or for the awk it would need to be $6 =<0.01 and $3 =>100 for q-value and length respectively