poseidon-framework / poseidon-hs

A toolset to work with modular genotype databases in the Poseidon format
https://poseidon-framework.github.io/#/trident
MIT License
7 stars 2 forks source link

Plink format data Phenotype column set to missing for all data #128

Closed TCLamnidis closed 1 year ago

TCLamnidis commented 3 years ago

This might be more of an issue the plink formatted packages themselves, instead of poseidon, but I'll put it here for now.

Problem description

The plink fam file has the following columns:

1: Family ID ('FID') 2: Within-family ID ('IID'; cannot be '0') 3: Within-family ID of father ('0' if father isn't in dataset) 4: Within-family ID of mother ('0' if mother isn't in dataset) 5: Sex code ('1' = male, '2' = female, '0' = unknown) 6: Phenotype value ('1' = control, '2' = case, '-9'/'0'/non-numeric = missing data if case/control)

It seems we generally put/allow column 6 to be all 0s in any(/every) package. Some programs, e.g. smartpca, will actually look at this column to decide what individuals to include in the analysis. More specifically, a 0 is interpreted as missing phenotype, and the individual excluded from the run. Since all individuals are set to 0 (as far as I've seen thus far) the following error is thrown, and the runs killed:

parameter file: smartpca_runs/poplist1/parameters.txt
### THE INPUT PARAMETERS
##PARAMETER NAME: VALUE
genotypename: forged_package/PCA_package_1.bed    
snpname: forged_package/PCA_package_1.bim        
indivname: forged_package/PCA_package_1.fam        
evecoutname: smartpca_runs/poplist1/PCA_poplist1_plink.evec           
evaloutname: smartpca_runs/poplist1/PCA_poplist1_plink.eval           
poplistname: PCA_poplists/PCA_poplist1.txt
lsqproject: YES     
outliermode: 2      
numoutevec: 4       
## smartpca version: 16000
norm used

lsqproject used
French:HGDP00511 ignored
French:HGDP00512 ignored
French:HGDP00513 ignored
[...]
Adygei:YY-18 ignored
Adygei:YY-36 ignored
all individuals set ignore.  Likely input problem (col 6)
resetting all individual...
genotype file processed
nodata:            rs3094315
nodata:            rs7419119
[...]
nodata:          rs116026001
nodata:        Affx-50065629
number of samples used: 0 number of snps used: 0
Using 1 thread, and partial sum lookup algorithm.
total number of snps killed in pass: 0  used: 0
fatalx:
XTX has zero trace (perhaps no data)
Abort trap: 6

After all individuals are set to missing, there is no data left to analyse, and a huge error is printed out, which lists all individuals and SNPs in the dataset!

Expected behaviour

That a user can use trident forge and smartpca without having to tweak the fam file to correct this.

Existing workaround

Reformatting the forged poseidon package to eigenstrat format circumvents this issue.

TCLamnidis commented 3 years ago

I can think of two possible solutions/workarounds other than using eigenstrat datasets.

  1. Adding a flag to poseidon forge that forces the sixth column of the fam file to either 1 or 2. The downside being that the hash of the package won't match the original package anymore
  2. Enforcing that the sixth column of the fam file is NOT 0 in packages. Preferably set to always be 1 or 2. This would require updating many existing packages, but keep the hashes consistent.
stschiff commented 3 years ago

OK, I think this is a glitch in smartPCA, since the spec clearly says that the case/control column is only for ... well ... case/control datasets. Annoying! But I'm happy to make it easier for users, so I have nothing against simply adding a flag to forge to make that column non-zero. I don't get your point about the hash. Hashes are computed per package, after the genotype files are made, so there is no issue.

In the mean time, I see no problem of simply employing a one-liner in awk to change the column in the fam file to make it work for smartPCA.

TCLamnidis commented 3 years ago

Update: I cannot get smartpca to work even when using awk to change the column to all 1s or all 2s. Not sure what the issue is exactly. When doing so, all SNPs are printed as nodata: but the part listing all individuals as ignored is not present anymore.

stschiff commented 3 years ago

Update: After a lot of investigation, it's clear that smartpca expects the final column of the fam file to contain the population label. This is important, and requires a change even in the sequence-formats package. I have added It to my todo list. I think we should follow that convention, since Eigensoft has become so standard in the field.

stschiff commented 3 years ago

Update: Plink seems to be perfectly happy with the last column carrying something non-numeric (like the population label). However, it's also clear that in various plink outputs, the family ID (first column) can be useful to contain the population label as well. So I guess if we indeed want to switch to fam-files containing the population as first and last column, that will be a bit redundant, but we won't loose information (cause we neither have families nor genotypes in our data).

nevrome commented 3 years ago

Interesting insight from @esalmela at the popgen meeting: Using the family ID column in the .fam file for population labels is technically wrong in the first place. What disadvantages would we buy if we removed it there? @AyGhal

stschiff commented 3 years ago

Yes, after @esalmela's comments I'm also leaning towards switching to numbering the family IDS just 1,2,3,... and then putting the pop-label in the last column. @AyGhal you said that PLINK overwrites it, but we're not using PLINK to write new PLINK files right? We're just using it for MDS and stuff. I'll try and see what happens if we do it that way.

nevrome commented 3 years ago

One more comment/reminder regarding the updating procedure, if we decide to change all packages in the published_data repo: trident update fully automates the process of updating a Poseidon package, including updating checksums, incrementing the package version and creating/updating the package changelog file. After changing all .fam files with a script, a single, well thought through call to trident update in the published_data directory should be enough to formally update all packages at once.

stschiff commented 3 years ago

I can report that, consistent with @AyGhal's observation of Family IDs being somewhat more respected than the 6th fam column, when computing pairwise distances with --distance-matrix, the resulting files do contain the family ID information, which is very useful to use the group labels in downstream plotting and things. The 6th column doesn't appear.

So this is indeed the opposite behaviour of what Eigensoft does, which is more or less ignore the Family ID and instead look for the last column.

So I think we can't do this perfectly. I think my original solution of putting the population ID in both columns is still the most practical one.

Alternative: We add a flag for PLINK output that controls that behaviour? Chances are, of course, that people will simply overlook it, but still perhaps that'd be somewhat more transparent then?

Something like --plinkPopLabels which can assume three states:

how about that?

I think if we decide this, we don't even have to change the repository. We'd just make a note in the documentation that if people wish to use the data with Eigensoft, they need to forge or genoconvert it with the respective flag.

AyGhal commented 3 years ago

Yes, after @esalmela's comments I'm also leaning towards switching to numbering the family IDS just 1,2,3,... and then putting the pop-label in the last column. @AyGhal you said that PLINK overwrites it, but we're not using PLINK to write new PLINK files right? We're just using it for MDS and stuff. I'll try and see what happens if we do it that way.

We do change it to transposed format to let's say run READ. Or if you do any filtering (for autosomes, inds or SNPs) even if your output format is in PLINK binary you would loose population ID label.

stschiff commented 3 years ago

Ah OK. Indeed that's a problem

nevrome commented 1 year ago

After rereading this discussion I believe we should indeed add the --plinkPopLabels flag as suggested by you @stschiff. But this requires this option in sequence-formats first, right?

stschiff commented 1 year ago

Yes, it's a bit more involved change upstream, indeed. I had not followed this up further since then. Maybe I should

nevrome commented 1 year ago

I just saw that convertf also has an explicit cli argument for that behaviour: https://reich.hms.harvard.edu/software/InputFileFormats

image

outputgroup: This parameter specifies what the 6th column of information about each individual should be in the output. If outputgroup is set to NO (the default), the 6th column will be set to 1 for each Control and 2 for each Case, as specified in the input indiv file. [Individuals specified with some other label, such as a population group label, will be assumed to be controls and the 6th column will be set to 1.] If outputgroup is set to YES, the 6th column will be set to the exact label specified in the input indiv file. [This functionality preserves population group labels.]

stschiff commented 1 year ago

Ah interesting! Yes, I should get on that, really

AyGhal commented 1 year ago

Yes, I always use that parameter. I though at some point we talked about it.