mennodejong1986 / SambaR

SambaR: Snp datA Management and Basic Analyses in R
MIT License
24 stars 6 forks source link

Return of a problem you solved for someone previously #23

Closed bmillerlab closed 2 years ago

bmillerlab commented 2 years ago

Hello, I am running the new 1.08 version, but had this problem with 1.07 too. I ran vcftools and then plink to get my data in the right format. Then I got this error: WARNING: non-defined allele (NA or 0) present in bim-file. ERROR: encountered unexpected allele in bim file. Alleles should be either 1,2,3,4 or A,C,T,G (character). You can observe the alleles by typing 'myalleles'.

I looked in 'myalleles' and did not see anything odd. I read further in the verbose output (very helpful) and found this:

ERROR: encountered unexpected allele in bim file. Alleles should be either 1,2,3,4 or A,C,T,G (character). You can observe the alleles by typing 'myalleles'.

As mentioned in the SambaR manual, if you convert from vcf to PED/MAP format, the first column of your map file will likely contain zero's only.

to fix try

cut -f2 yourfile.map | cut -f1 -d ':' > mycontigs.txt && cut -f2,3,4 yourfile.map > mymap.txt && paste mycontigs.txt mymap.txt > yourfile.map && rm mycontigs.txt mymap.txt

So, I ran the cut command and then reran plink. I did not discard the 2 output files after the first time, since my problem persisted after this attempt to fix it. I then realized I should be using 1.08 and tried again, but I get the same error (without the information about running "cut" in the output though. I would really appreciate any help you can offer.

Oh, I did try using a genlight object to input the data instead, but when I do that I can't get the populations added to "mygenlight" and I really need them.

I am including the 2 files "mycontigs.txt" and "mymap.txt". I can still see the 0s in "mymap.txt" mycontigs.txt mymap.txt .

mennodejong1986 commented 2 years ago

Hi, You mention that the vector 'myalleles' does not show anything odd, so am I right to assume that you only find either A,C,G,T or 1,2,3,4, and nothing else? You can also visually observe the bim-file. Unlike the map-file, the bim-file lists the alleles (5th and 6th column). Same question: do you see in these columns any values other than 1,2,3,4 or characters other A,C,G,T?

The map-file does not contain these 5th and 6th columns, and instead has 4 columns only, namely:

The mymap.txt file you included (which only has three columns: variant_name, morgan, bp) is the intermediate file. The final map-file needed is 'yourfile.map', generated by the command: paste mycontigs.txt mymap.txt > yourfile.map Therefore, the zero's in the second column of the mymap.txt file are correct: they are the positions in morgans, for which you have no data (and you do not need to).

Regarding the genlight2sambar function, you can define the population assignment by specifying a vector to the popvector flag. For example (assuming you have 100 individuals, of which the first 50 in popA and the second 50 in popB):

pop_vector=c(rep("popA",50),rep("popB",50)) genlight2sambar(genlight_object="mygl",popvector=pop_vector)

Alternatively, you can define the population assignment in the genlight object prior to running the genlight2sambar function. For example: mygl@pop <- pop_vector genlight2sambar(genlight_object="mygl")

Of course, make sure that the order of the population assignment corresponds with the order of your samples in the genlight object (see 'mygenlight@ind.names').

bmillerlab commented 2 years ago

Thank you for the extended suggestions for helping fix the problem.

Yes, 'myalleles' is only showing A, T, G, C. But now I used your instructions and made a yourfile.map and also looked at the .bim file. I see that there are actually 0's in the .bim file!! I am not sure where they come from, so I'm not sure where my original error occured in the prior analysis.

Right now I am working on how to edit the file, since it doesn't open in anything I have that would make it easy to edit. There is a whole column of 0's that do belong (column 3 for morgans) - so I can't just use find-replace. I can't get the file to open in any editing program so far where it would create a real 5th column (or tab) I could edit. I'm really not good with R, but I'm guessing there is a function that would allow me to edit the 5th column only. I'm working on figuring out how to do that - but I don't know what to put in the column instead of 0 (zero) anyway! Since I have ~30,000 snps I can't really just fix the 0's by hand. I could just remove all the snps that have a 0 identified instead of a nucleotide, but that requires me to be able to edit the file too.

I will also just go back and try using genlight again and making a more explicit population file where I can easily count the total number of each population (which I should already know, but I removed a bunch of stuff with a filter before I moved to sambaR since we suddenly have a possible cryptic species issue.

I'll follow up once I get through trying to make the suggested edits/changes from above.

bmillerlab commented 2 years ago

Hello, I forgot to come back to github and let you know that I got it to work. I could not figure out how to fix the bim file so it work right. I did make a pop_vector value and got the genlight2sambar option to work!
Thanks again for your help. Best,