veg / hyphy

HyPhy: Hypothesis testing using Phylogenies
http://www.hyphy.org
Other
201 stars 68 forks source link

Issue with CleanStopCodons.bf: Incompatible matrix dimensions in call to CheckDimension: 1x64 and 8000x1 #1590

Closed mbarkdull closed 1 year ago

mbarkdull commented 1 year ago

Good afternoon,

I am trying to remove stop codons from a few thousand input files, using essentially the following code: hyphy /programs/hyphy-2.5.49/share/hyphy/TemplateBatchFiles/CleanStopCodons.bf Universal ./removedStops/cleaned_OG0014256_cdsSequences.fasta No/Yes test.fasta.

CleanStopCodons.bf works on the majority of inputs, but for a minority, it fails with the error:

Universal

Data Read:
5 species:{CSM3677_CVAR_11742_RA_p1, CSM3685_CVAR_11742_RA_p1, CVAR_CVAR_11742_RA_p1, POW0123_CVAR_11742_RA_p1, POW0461_CVAR_11742_RA_p1};
Total Sites:1980;
Distinct Sites:8No/Yes
Error:
Incompatible matrix dimensions in call to CheckDimension: 1x64 and 8000x1

    While computing: stopCodonTemplate*siteInfo

Function call stack
1 :  siteInfo1=stopCodonTemplate*siteInfo;
    Standard input redirect:
        000000000004 : test.fasta

-------

I'm attaching a sample input so you can replicate the error.

If you could help me understand what the issue is, I would really appreciate it. If there is no way to run CleanStopCodons.bf on these inputs, that's fine, but I want to make sure I'm not making an obvious mistake.

Thank you so much!
cleaned_OG0014256_cdsSequences.txt

jzehr commented 1 year ago

Dear @mbarkdull,

Looking at the sample input you attached here, 4 of the 5 sequences in the fasta file are identical and a large percentage of the sequence is an N -- those factors may be throwing off the script. Do you see similar patterns in the other input files that fail? If so, you may want to add a filter step to remove files where the sequence is comprised of more than 50% Ns (or some appropriate cut-off).

@spond may have a more technical explanation, but that is where I would start.

Best,

spond commented 1 year ago

Dear @mbarkdull,

HyPhy incorrectly auto-detects this file as containing amino-acids (since N is a valid A/A). Unfortunately, there is no "smooth" workarounds. See https://github.com/veg/hyphy/issues/1574

One option (ugly, but should work) is to just add the following text at the very top (line 1) of the offending file.

BASESET :"ACGT"

Another option is to strip out N prior to calling HyPhy.

Best, Sergei

cleaned_OG0014256_cdsSequences.txt

mbarkdull commented 1 year ago

Dear @jzehr and @spond,

Thank you so much for your quick replies. It does appear that all of the failures are caused by a high proportions of Ns in the sequences, so I'll explore those workarounds.

Very best, Megan