shiraz-shah / VFCs

Code for de novo discovery of viral families in virome data
13 stars 2 forks source link

'awk: cmd. line:1: (FILENAME=- FNR=1) fatal: division by zero attempted' During orthology-filter step #1

Open haleyhallowell opened 1 year ago

haleyhallowell commented 1 year ago

Hello! I am currently trialling your code for clustering viral species. However, I am repeatedly hitting an error during the orthology-filter step. I have isolated out the portion of the command that is throwing the error here:

cat vOTUs.fasta36 | ./joincol vOTUs.faa.lengths | ./joincol vOTUs.faa.lengths 2 | awk '{print $1 "\t" $2 "\t" $11 "\t" $13/$14 "\t" ($8-$7)/(2*$13)+($10-$9)/(2*$14) "\t" ($7+$8-$9-$10)/($13+$14)}' | awk '{if ($3 <= 0.05) print}' | awk '{if ($5 >= 0.4) print}' | awk '{if (sqrt(($4-1)^2) - (sqrt(sqrt($5))-.8) + sqrt($6^2) <= 0.1) print $1 "\t" $2}' > test.tsv

And get the following error:

awk: cmd. line:1: (FILENAME=- FNR=1) fatal: division by zero attempted

I think my .faa.lengths file looks the way that it should (example snippet here):

k127_510701||0_partial_1 # 1 # 1107 # -1 # ID=1_1;partial=00;start_type=ATG;rbs_motif=GGxGG;rbs_spacer=5- 10bp;gc_cont=0.563 369 k127_510701||0_partial_2 # 1107 # 2582 # -1 # ID=1_2;partial=00;start_type=ATG;rbs_motif=GGxGG;rbs_spacer=5-10bp;gc_cont=0.539 492 k127_510701||0_partial_3 # 2735 # 3949 # -1 # ID=1_3;partial=00;start_type=ATG;rbs_motif=GGAGG;rbs_spacer=5-10bp;gc_cont=0.477 405 k127_510701||0_partial_4 # 4254 # 4934 # -1 # ID=1_4;partial=00;start_type=ATG;rbs_motif=GGAGG;rbs_spacer=5-10bp;gc_cont=0.492 227 k127_510701||0_partial_5 # 4931 # 5338 # -1 # ID=1_5;partial=00;start_type=ATG;rbs_motif=AGGAGG;rbs_spacer=5-10bp;gc_cont=0.525 136 k127_510701||0_partial_6 # 5354 # 5650 # -1 # ID=1_6;partial=00;start_type=ATG;rbs_motif=GGAGG;rbs_spacer=5-10bp;gc_cont=0.384 99 k127_510701||0_partial_7 # 5647 # 5937 # -1 # ID=1_7;partial=00;start_type=ATG;rbs_motif=AGGAGG;rbs_spacer=5-10bp;gc_cont=0.450 97

Any insight on how to fix this? Thanks!!

shiraz-shah commented 1 year ago

Hey Haley, Could you try to trim down the headers in the protein fasta file so they don’t have all that extra prodigal info? Then recompute the lengths file. Then proceed to the orthology step.

So e.g. the first protein’d ID should be only k127_510701||0_partial_1 and so on. Fasta good already trained that part of you look at the tabular output, but since it’s there in your lengths file, joincol can’t match up the columns.

Thanks for pointing this out. I’ll try to fix it, but in the meantime the above hack should do the trick.

On Sun, 2 Jul 2023 at 04.52, haleyhallowell @.***> wrote:

Hello! I am currently trialling your code for clustering viral species. However, I am repeatedly hitting an error during the orthology-filter step. I have isolated out the portion of the command that is throwing the error here:

cat vOTUs.fasta36 | ./joincol vOTUs.faa.lengths | ./joincol vOTUs.faa.lengths 2 | awk '{print $1 "\t" $2 "\t" $11 "\t" $13/$14 "\t" ($8-$7)/(2$13)+($10-$9)/(2$14) "\t" ($7+$8-$9-$10)/($13+$14)}' | awk '{if ($3 <= 0.05) print}' | awk '{if ($5 >= 0.4) print}' | awk '{if (sqrt(($4-1)^2) - (sqrt(sqrt($5))-.8) + sqrt($6^2) <= 0.1) print $1 "\t" $2}' > test.tsv

And get the following error:

awk: cmd. line:1: (FILENAME=- FNR=1) fatal: division by zero attempted

I think my .faa.lengths file looks the way that it should (example snippet here):

k127_510701||0_partial_1 # 1 # 1107 # -1 # ID=1_1;partial=00;start_type=ATG;rbs_motif=GGxGG;rbs_spacer=5- 10bp;gc_cont=0.563 369 k127_510701||0_partial_2 # 1107 # 2582 # -1 # ID=1_2;partial=00;start_type=ATG;rbs_motif=GGxGG;rbs_spacer=5-10bp;gc_cont=0.539 492 k127_510701||0_partial_3 # 2735 # 3949 # -1 # ID=1_3;partial=00;start_type=ATG;rbs_motif=GGAGG;rbs_spacer=5-10bp;gc_cont=0.477 405 k127_510701||0_partial_4 # 4254 # 4934 # -1 # ID=1_4;partial=00;start_type=ATG;rbs_motif=GGAGG;rbs_spacer=5-10bp;gc_cont=0.492 227 k127_510701||0_partial_5 # 4931 # 5338 # -1 # ID=1_5;partial=00;start_type=ATG;rbs_motif=AGGAGG;rbs_spacer=5-10bp;gc_cont=0.525 136 k127_510701||0_partial_6 # 5354 # 5650 # -1 # ID=1_6;partial=00;start_type=ATG;rbs_motif=GGAGG;rbs_spacer=5-10bp;gc_cont=0.384 99 k127_510701||0_partial_7 # 5647 # 5937 # -1 # ID=1_7;partial=00;start_type=ATG;rbs_motif=AGGAGG;rbs_spacer=5-10bp;gc_cont=0.450 97

Any insight on how to fix this? Thanks!!

— Reply to this email directly, view it on GitHub https://github.com/shiraz-shah/VFCs/issues/1, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOJORBBSE5R2VRTRP3FRGGLXODO6BANCNFSM6AAAAAAZ3FDRLQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- Shiraz A. Shah, MSc, PhD Senior Researcher C‌openhagen Prospective Studies on Asthma in Childhood H‌erlev and G‌entofte H‌ospital, U‌niversity of C‌openhagen www.copsac.com

haleyhallowell commented 1 year ago

HI! Thanks for your quick reply!

I edited the original .faa file to look like this before fasta36


>k127_510701||0_partial_1
FLGKKMPEGTAVEWCRTLLGPLFGNLVLLGLGAKLFCDGALELRIFSEIIRRTMLPATPV
WVLTLVVLLIAGALAAQGMECRGRTAEILFFLVSVPLLVVLLAVAFSAEYGRVLPLEPPA
AEGWKDAFASLGVLFQGLAFLYFIIPDLKKPQKMQRAVLAAGVHAALLFATLTFLSLAVY
GEVMLKQKLLPTLQMLERVSFTGIFLTRQDVLLLWFWIASVSVFLSGTLFFSSLMGVRMA
GQTEAKRKNWLYAALAVLFLLSFLPEDMAQAHEFRLRASLWLNGFYLLVLPALLLFLAAV
RKKGGKAV*
>k127_510701||0_partial_2
MARCEKQGTNGILQSLMEDVIAIGEMGKITTLQEVYDAVLLGDTVLLMQGDDFALQASTK
GFPSRGVNKAETEVVVQGPKDAFTELMSINVVLTRRRIRDTKLKLKRKKVGTRSKTDVAL

And reran fasta36, along with the other scripts and still threw the same error: awk: cmd. line:1: (FILENAME=- FNR=1) fatal: division by zero attempted

shiraz-shah commented 1 year ago

Hey Haley, Can I see what your lengths file looks like?

haleyhallowell commented 1 year ago

Sure! it looks like this:

k127_510701||0_partial_1        309
k127_510701||0_partial_2        432
k127_510701||0_partial_3        345
k127_510701||0_partial_4        167
k127_510701||0_partial_5        136
k127_510701||0_partial_6        99
k127_510701||0_partial_7        37
k127_510701||0_partial_8        94
k127_510701||0_partial_9        3390
k127_510701||0_partial_10       413
k127_510701||0_partial_11       463
k127_510701||0_partial_12       497
k127_510701||0_partial_13       127
k127_510701||0_partial_14       75
k127_510701||0_partial_15       38
k127_510701||0_partial_16       463
k127_510701||0_partial_17       209
k127_281361||0_partial_1        39
k127_281361||0_partial_2        304
k127_281361||0_partial_3        50
k127_281361||0_partial_4        343
k127_281361||0_partial_5        173
k127_715127||0_partial_1        95
shiraz-shah commented 1 year ago

It looks like it should. Could you maybe send me the fasta file so I can try to reproduce the error? How big is it?

haleyhallowell commented 1 year ago

Sure! Where can I send it? It’s about ~40MB

shiraz-shah commented 1 year ago

Alright, so too large for email. Do you have any suggestions? Have you ever used Google Drive or WeTransfer or DropBox? If you already have a google account, Google Drive is probably the easiest. My google account is shiraz.shah@dbac.dk, if you want to share the file with me privately via Google Drive.

haleyhallowell commented 1 year ago

ah, its actually about 14. ill go ahead and email it to the email above! Thanks!

shiraz-shah commented 1 year ago

Hey Haley, It's because the fasta file has a sequence of length 0 called "k127_1459403||full_27". If you remove that sequence, and redo the fasta step, then I think it should work.

The reason why fasta36 is even able to produce an alignment result for that sequence, is because it automatically takes the next sequence "k127_1459403||full_28" and treats it as if it was "k127_1459403||full_27".

haleyhallowell commented 1 year ago

Hello! I found a couple fasta entries without sequences. I went through and removed those sequences and checked the lengths file to make sure there were no more 0s, then re-ran fasta36 as well as the steps below, and it still gave me the same error. I then went back and found a few fasta sequences with a '*' as its sequence entry and removed those as well to see if that was the issue. I just finished that run and got the same error as well. I double checked that the files were correct, etc

shiraz-shah commented 1 year ago

Could you send me the corrected fasta file? I’d like to try again.

haleyhallowell commented 1 year ago

Sure! Sending to you now.

shiraz-shah commented 1 year ago

Hey Haley, I ran everything with the new file, and it seems to work fine, actually. Here's what I did: fasta36 test_2removed.faa test_2removed.faa -m 8 > test_2removed.fastab cat test_2removed.faa | ./f2s | ./seqlengths > test_2removed.lengths cat test_2removed.fastab | ./joincol test_2removed.lengths | ./joincol test_2removed.lengths 2 | awk '{print $1 "\t" $2 "\t" $11 "\t" $13/$14 "\t" ($8-$7)/(2*$13)+($10-$9)/(2*$14) "\t" ($7+$8-$9-$10)/($13+$14)}' | awk '{if ($3 <= 0.05) print}' | awk '{if ($5 >= 0.4) print}' | awk '{if (sqrt(($4-1)^2) - (sqrt(sqrt($5))-.8) + sqrt($6^2) <= 0.1) print $1 "\t" $2}' > test_2removed.tsv

$ head test_2removed.tsv
k127_510701||0_partial_1    k127_510701||0_partial_1
k127_510701||0_partial_1    k127_655921||full_23
k127_510701||0_partial_1    k127_423438||full_103
k127_510701||0_partial_2    k127_510701||0_partial_2
k127_510701||0_partial_2    k127_156558||full_89
k127_510701||0_partial_2    k127_423438||full_104
k127_510701||0_partial_2    k127_655921||full_22
k127_510701||0_partial_2    k127_214787||0_partial_88
k127_510701||0_partial_2    k127_362384||full_76
k127_510701||0_partial_2    k127_259829||0_partial_130
haleyhallowell commented 1 year ago

got it working! Don't know why it wasn't running from my end. Thanks so much for your time and help!

shiraz-shah commented 12 months ago

No problem. Hope it's useful!