ERROR: duplicate sample names found in input data (PED file)

HanXiaoEvo commented 2 years ago

Hi Menno,

I am getting trouble loading my input file generated from vcf via bed/map to raw/bim. Here is the error:

Reading PLINK raw format into a genlight object...

Reading loci information...

Reading and converting genotypes... . Building final object...

...done.

Creating snps dataframe... WARNING: input raw/ped file contains duplicated SNP names. Adding numbers to make them unique, in order to avoid errors downstream. Removing ':+' and ':-' from SNP names... snps$minor not factor class. snps$major not factor class. Creating inds dataframe... ERROR: duplicate sample names found in input data (PED file). A list of the duplicated sample name(s) is saved in the vector 'myduplicates'. Makes changes to filenames in the PED file (second column), convert to raw, and afterwards try running the importdata() function again.

myduplicates [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "1" "2" "3" "4" "5" "6" "7" "8" "9"
[20] "10" "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14" "18" "19" "20" "1"
[39] "2" "3" "4" "5" "6" "7" "8" "9" "10" "12" "13" "14" "15" "16" "17" "18" "19" "20" "21" [58] "22" "23" "24" "25" "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "1" "2" "3"
[77] "4" "5" "6" "7" "8" "9" "10" "11" "12" "005" "006" "008" "009" "010" "011" "012" "013" "014" "016" [96] "001" "002" "003" "004" "005" "006" "014" "015" "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" [115] "12" "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "1" "2" "3" "4" "5" "6"
[134] "7" "8" "9" "10" "11" "12" "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "002" [153] "003" "004" "005" "006" "007" "008" "009" "010" "011" "012" "013" "014" "015" "016" "017"

I am not sure how should I deal with it as for sure I have the same numbers for samples from different pops....like XX_01, YY_01 etc. I generated the files all the same as your suggestions but only different when I put -allow-extra-chr to make the ped file. Thank you very much and I look forward to your suggestions!

Best regards, Han

mennodejong1986 commented 2 years ago

Hi Han, SambaR does not accept duplicate sample names. If samples from different populations would be called XX_01, YY_01, etc this would be allowed, but it is not allowed if population XX and population YY both contain a sample called 01. The latter seems to be the case currently. This is easy to fix though. If the data set is not too big, the easiest way would be to open the ped-file in Excel and edit the second column (which contains sample names). In case the ped-file can not be opened in Excel, you have to think of a command which you can run on the command line (for example using sed or awk) to edit the second column of the ped-file. Afterwards convert to raw/bim and then retry to run the importdata function. Hope that helps! Menno

HanXiaoEvo commented 2 years ago

Hi Menno,

Thanks for your super quick reply and I will have a try later today!

Best, Han

mennodejong1986 / SambaR

ERROR: duplicate sample names found in input data (PED file) #19