gtonkinhill / fastbaps

A fast approximation to a Dirichlet Process Mixture model (DPM) for clustering genetic data
MIT License
58 stars 8 forks source link

Errors with small alignments as input #20

Closed johnlees closed 2 years ago

johnlees commented 2 years ago

Running this as part of a pipeline, we end up with a tiny alignment:

>17059_6#10
AGCAC
>17059_6#22
AGCTG
>15995_1#182
AGCTG
>15995_1#150
TATTC
>15995_1#114
AGCTC
>17059_6#81
AGCTC

If I save this as small_fb.aln and import with sparse.data <- import_fasta_sparse_nt("~/Downloads/small_fb.aln") I get the following error:

Error in base::colSums(x, na.rm = na.rm, dims = dims, ...) : 
  'x' must be an array of at least two dimensions

backtrace is

6: stop("'x' must be an array of at least two dimensions")
5: base::colSums(x, na.rm = na.rm, dims = dims, ...)
4: colSums(snp.matrix == 1)
3: colSums(snp.matrix == 1)
2: matrix(c(rep(nrow(snp.matrix), ncol(snp.matrix)), colSums(snp.matrix == 
       1), colSums(snp.matrix == 2), colSums(snp.matrix == 3), colSums(snp.matrix == 
       4)), nrow = 5, byrow = TRUE)
1: import_fasta_sparse_nt("~/Downloads/small_fb.aln")

which I think is from here: https://github.com/gtonkinhill/fastbaps/blob/master/R/import_fasta_sparse_nt.R#L87-L91

At first I thought this was because we only have one non-singleton column, but adding more sites didn't immediately help. I haven't read through the code properly, but I am guessing this is because there needs to be at least one observation of the four bases in passing sites. When I get to every base being represented in non-singleton sites it works ok:

>17059_6#10
AGCACACTG
>17059_6#22
AGCTGACTG
>15995_1#182
AGCTGTAAG
>15995_1#150
TATTCTAAG
>15995_1#114
AGCTCTAAA
>17059_6#81
AGCTCTAAA

For these cases, is it still possible to set the prior and keep running? If not, would it be possible to give a more informative error message?

(Sorry this is a bit of an edge case, it's just hard to work around from the error message we currently get)

gtonkinhill commented 2 years ago

By default fastbaps excludes SNPs that only appear in a single genome https://github.com/gtonkinhill/fastbaps/blob/master/R/import_fasta_sparse_nt.R#L80-L83.

I think this error is being thrown when there is <= 1 remaining SNP that is shared by more than one genome. The code works if you duplicate the SNP in your first example

>17059_6#10
AGCACC
>17059_6#22
AGCTGG
>15995_1#182
AGCTGG
>15995_1#150
TATTCC
>15995_1#114
AGCTCC
>17059_6#81
AGCTCC

It it probably best not to continue with the algorithm if only a single SNP remains. Is there a particular error message that would work better for your pipeline?

johnlees commented 2 years ago

I think this error is being thrown when there is <= 1 remaining SNP that is shared by more than one genome. The code works if you duplicate the SNP in your first example

I thought that at first too, but the following also errors:

>17059_6#10
AGCACAAC
>17059_6#22
AGCTGAAC
>15995_1#182
AGCTGTTG
>15995_1#150
TATTCTTG
>15995_1#114
AGCTCTTG
>17059_6#81
AGCTCTTG

So I think it's something slightly different

It it probably best not to continue with the algorithm if only a single SNP remains. Is there a particular error message that would work better for your pipeline?

Obviously sensible. Just anything along the lines of 'not enough variants' would do!

gtonkinhill commented 2 years ago

It looks like there was a bug when loading from fasta files rather than DNAbin objects.

The initial sequence was not counted in the definition of the consensus variant. This led to problems in very small alignments when a variant was only present twice and one of these examples was in the first sequence.

I've pushed a fix for this and added a better error message. Once the automatic checks have passed I'll create a new release.

Thanks for pointing this out!

johnlees commented 2 years ago

awesome, thanks for the quick fix!