Using unitigs generated from short-reads

WMonteith commented 1 year ago

We've run a GWAS using unitigs generated from whole genomes using pyseer. We would like to repeat the same experiment using unitigs generated from short-read data instead of whole-genome sequences.

The original command used was:

pyseer --wg enet --phenotypes phenotypes.txt --kmers Unitigs.txt --uncompressed --distances phylogeny_distances.tsv --alpha 1 --cpu 30 --output-patterns patterns.txt > selected.txt

We've tried to repeat the experiment using unitigs generated from short-read data and with a modified phenotype file and similarity matrix, but we get the following error:

Could you please help us understand why this isn't working?

BW, Billy

johnlees commented 1 year ago

Do all the sample labels match between the unitig file, phenotype file and in the similarity file. In particular, is 51154_2 in all of them?

In theory we should support mismatches, and it looks like we might just need to update our pandas command to reindex, but it would be good to understand how this is failing so we can add the correct fix.

mgalardini commented 1 year ago

It's actually a bit confusing, because the line before that one raising the exception defines the shared labels:

https://github.com/mgalardini/pyseer/blob/master/pyseer/input.py#L98

I guess it could mean that the distance matrix is not symmetrical? I will add a further intersection on the columns of m to catch these kinds of mistakes

mgalardini / pyseer

Using unitigs generated from short-reads #223