nonstandard chromosome IDs

stevemussmann / admixturePipeline

A pipeline that accepts a VCF file to run through Admixture

GNU General Public License v3.0

54 stars 19 forks source link

nonstandard chromosome IDs #20

Open RvV1979 opened 2 hours ago

RvV1979 commented 2 hours ago

Hi Steve,

I am analyzing a dataset with nonstandard chromosome IDs (specifically, genbank accession codes) and therefore get Error: Invalid chromosome code 'NC_XXXX' on line 1 of .bim file. In my own pipelines, I add --allow-extra-chr to my plink commands to solve this.

Is there a way to have the admixturePipeline work with such nonstandard chromosome IDs?

Thanks

stevemussmann commented 2 hours ago

Good morning,

What kind of inputs are you using? If you are starting from a VCF, then plink should be adding the --allow-extra-chr automatically. If you are starting from a plink format then that might not yet be the case since that feature is still relatively new.

If you're starting from a plink file then a work-around might be running your file through Plink with the --allow-extra-chr option before providing it to admixpipe.

-Steve

RvV1979 commented 1 hour ago

Ah yes, sorry that was not clear. I was using a PLINK bed file as input. This was already generated by Plink with the --allow-extra-chr option so that does not solve the issue. In my experience, that option needs to be used every time Plink is used on any file with nonstandard chromosome IDs.

To try the vcf-route, I exported the bed file to vcf-iid format with Plink and used that as input. Then, vcf-query outputs a list of individuals, and vcftools outputs a table of individual missingness (this is very low: I have filtered for that in earlier steps). However, for some reason, the next vcftools step removes all individuals with many --remove-indv calls. In addition, for each chromosome, I get Unrecognized values used for CHROM: NC_XXXX - Replacing with 0. So it seems the issue of nonstandard chromosome IDs is not appropriately handled even before calling Plink.

If you have some more ideas I would be much obliged.

Thanks

stevemussmann commented 10 minutes ago

Thanks for the additional details. It is my understanding that Admixture does not utilize chromosome information (both the Alexander et al. 2009 paper and the Admixture manual state that linkage equilibrium is assumed, so datasets should be filtered for LD prior to running Admixture), and my recollection is that the program itself is very restrictive in what it will allow as chromosome names. Comments on code in this website seem to match my memory (https://speciationgenomics.github.io/ADMIXTURE/). Consequently, I do not retain this information in the Plink conversion because it is counterproductive to running Admixture. If you are receiving those warnings but Admixture itself is running, then the pipeline is working as-intended.

The --remove-indv calls in vcftools might be happening if some individuals are present in your vcf file that are not present in your population map file.