Closed TCLamnidis closed 1 year ago
I can think of two possible solutions/workarounds other than using eigenstrat datasets.
1
or 2
. The downside being that the hash of the package won't match the original package anymore0
in packages. Preferably set to always be 1
or 2
. This would require updating many existing packages, but keep the hashes consistent. OK, I think this is a glitch in smartPCA, since the spec clearly says that the case/control column is only for ... well ... case/control datasets. Annoying! But I'm happy to make it easier for users, so I have nothing against simply adding a flag to forge to make that column non-zero. I don't get your point about the hash. Hashes are computed per package, after the genotype files are made, so there is no issue.
In the mean time, I see no problem of simply employing a one-liner in awk
to change the column in the fam file to make it work for smartPCA.
Update: I cannot get smartpca to work even when using awk
to change the column to all 1s or all 2s. Not sure what the issue is exactly. When doing so, all SNPs are printed as nodata:
but the part listing all individuals as ignored
is not present anymore.
Update: After a lot of investigation, it's clear that smartpca expects the final column of the fam
file to contain the population label. This is important, and requires a change even in the sequence-formats package. I have added It to my todo list. I think we should follow that convention, since Eigensoft has become so standard in the field.
Update: Plink seems to be perfectly happy with the last column carrying something non-numeric (like the population label). However, it's also clear that in various plink outputs, the family ID (first column) can be useful to contain the population label as well. So I guess if we indeed want to switch to fam-files containing the population as first and last column, that will be a bit redundant, but we won't loose information (cause we neither have families nor genotypes in our data).
Interesting insight from @esalmela at the popgen meeting: Using the family ID column in the .fam file for population labels is technically wrong in the first place. What disadvantages would we buy if we removed it there? @AyGhal
Yes, after @esalmela's comments I'm also leaning towards switching to numbering the family IDS just 1,2,3,... and then putting the pop-label in the last column. @AyGhal you said that PLINK overwrites it, but we're not using PLINK to write new PLINK files right? We're just using it for MDS and stuff. I'll try and see what happens if we do it that way.
One more comment/reminder regarding the updating procedure, if we decide to change all packages in the published_data repo: trident update
fully automates the process of updating a Poseidon package, including updating checksums, incrementing the package version and creating/updating the package changelog file. After changing all .fam files with a script, a single, well thought through call to trident update
in the published_data directory should be enough to formally update all packages at once.
I can report that, consistent with @AyGhal's observation of Family IDs being somewhat more respected than the 6th fam column, when computing pairwise distances with --distance-matrix
, the resulting files do contain the family ID information, which is very useful to use the group labels in downstream plotting and things. The 6th column doesn't appear.
So this is indeed the opposite behaviour of what Eigensoft does, which is more or less ignore the Family ID and instead look for the last column.
So I think we can't do this perfectly. I think my original solution of putting the population ID in both columns is still the most practical one.
Alternative: We add a flag for PLINK output that controls that behaviour? Chances are, of course, that people will simply overlook it, but still perhaps that'd be somewhat more transparent then?
Something like --plinkPopLabels
which can assume three states:
FamilyAsPoplabel
PhenotypeAsPoplabel
BothAsPoplabel
(default)how about that?
I think if we decide this, we don't even have to change the repository. We'd just make a note in the documentation that if people wish to use the data with Eigensoft, they need to forge or genoconvert it with the respective flag.
Yes, after @esalmela's comments I'm also leaning towards switching to numbering the family IDS just 1,2,3,... and then putting the pop-label in the last column. @AyGhal you said that PLINK overwrites it, but we're not using PLINK to write new PLINK files right? We're just using it for MDS and stuff. I'll try and see what happens if we do it that way.
We do change it to transposed format to let's say run READ. Or if you do any filtering (for autosomes, inds or SNPs) even if your output format is in PLINK binary you would loose population ID label.
Ah OK. Indeed that's a problem
After rereading this discussion I believe we should indeed add the --plinkPopLabels
flag as suggested by you @stschiff. But this requires this option in sequence-formats
first, right?
Yes, it's a bit more involved change upstream, indeed. I had not followed this up further since then. Maybe I should
I just saw that convertf
also has an explicit cli argument for that behaviour: https://reich.hms.harvard.edu/software/InputFileFormats
outputgroup: This parameter specifies what the 6th column of information about each individual should be in the output. If outputgroup is set to NO (the default), the 6th column will be set to 1 for each Control and 2 for each Case, as specified in the input indiv file. [Individuals specified with some other label, such as a population group label, will be assumed to be controls and the 6th column will be set to 1.] If outputgroup is set to YES, the 6th column will be set to the exact label specified in the input indiv file. [This functionality preserves population group labels.]
Ah interesting! Yes, I should get on that, really
Yes, I always use that parameter. I though at some point we talked about it.
This might be more of an issue the plink formatted packages themselves, instead of poseidon, but I'll put it here for now.
Problem description
The plink fam file has the following columns:
It seems we generally put/allow column 6 to be all 0s in any(/every) package. Some programs, e.g.
smartpca
, will actually look at this column to decide what individuals to include in the analysis. More specifically, a0
is interpreted as missing phenotype, and the individual excluded from the run. Since all individuals are set to 0 (as far as I've seen thus far) the following error is thrown, and the runs killed:After all individuals are set to missing, there is no data left to analyse, and a huge error is printed out, which lists all individuals and SNPs in the dataset!
Expected behaviour
That a user can use
trident forge
andsmartpca
without having to tweak the fam file to correct this.Existing workaround
Reformatting the forged poseidon package to eigenstrat format circumvents this issue.