voichek / kmersGWAS

A library for running k-mers based GWAS
GNU General Public License v3.0
105 stars 24 forks source link

Confusing PAV table output - vcf #128

Closed renhamm closed 1 year ago

renhamm commented 1 year ago

When I convert my PAV table to vcf format, I end with three options instead of two: "1/1", "0/0", and "./." The ./. usually means we dont have coverage in this area when looking at normal SNPs, but as our loci are actually unmapped kmers, I'm not sure how the "./." differs from "0/0"? Is this indicative of an upstream error?

voichek commented 1 year ago

Hi Lauren,

Just to make sure I understand you correctly - you have a k-mers-table which you converted to a plink binary format using the kmers_table_to_bed functionality, then you took the plink and converted it to a VCF and got some k-mers in some accessions with the value "./." which mean a missing value. Is that correct? If this is the case it shouldn't happened.

Yoav

renhamm commented 1 year ago

Yes. Here is the commands I used if that helps:

create the kmer PAV table

/PATH/build_kmers_table -l kmers_list_paths.txt -k 31 -a kmers_to_use -o kmers_table

convert table to PLINK binary format

/PATH/kmers_table_to_bed -t kmers_table -k 31 -b 90000000000 -p phenotypes.pheno --maf 0.05 --mac 1 -b 10000000 -o PAVtable_PLINK\

to merge all the bed sub-files

plink --bfile /PATH/PAVtable_PLINK.0 --merge-list all_my_files.txt --make-bed --out PAVmerged

convert to vcf format

/PATH/plink --bfile /PATH/PAVmerged --recode vcf --out PAVmerged

voichek commented 1 year ago

Hi Lauren,

Thank you.

  1. Why do you set -b twice in kmers_table_to_bed? it shouldn't matter, but I am surprised it didn't raise an error.
  2. To debug this I will probably need to look at the kmers_table, I know it is probably a big file, but do you think you can share it with me?
  3. In the meanwhile can you upload here a few lines from the vcf file with k-mers that have the ./.? and also the textual presence absence pattern of the same k-mers as outputed with the filter_kmers functionality ? maybe by comparing the two I can understand what happened.
  4. Did you run the above commands on the same machine/computer? Can you tell me which CPU it had (you can use the cat /proc/cpuinfo output).
renhamm commented 1 year ago
  1. I must have sent that as a typo, I will fix that in my notes. Likely my actual script doesn't include the repetition, thus the lack of errors
  2. I'm working on transferring that file now and will upload it shortly.
  3. I've copied a few lines from two of the resulting vcfs below:
  4. Partially, all but one command (including those to create the PAV table) were run on the same remote supercomputing cluster - Savio at UC Berkeley. It uses a Intel Xeon E5-2670 v2, 2.50 GHz CPU. The final vcf step was too intensive so it was run on an xl_mem node with a Intel Xeon Skylake 6130 @ 2.1 GHz CPU.

- The pipeline outputs thousands of smaller vcf files, of which some look right and some do not. VCFs #000-099 look perfect, but all VCFs with a number greater than or equal to 100 include the weird ./. missing value option. From a good sub-vcf: 0 0 AAAAAAAAAAAAAAAAAGAAAAAAAAAGCTA N 1 . . PR GT 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 1/1 0/0 0/0 0/0 0/0 0/0 0/0 1/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 1/1 0/0 1/1 0/0 1/1 0/0 1/1 0/0 0/0 0/0 1/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 1/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 1/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 1/1 0/0 0/0 0/0 1/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 1/1 0/0 0/0 0/0 0/0 1/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 1/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 1/1 0/0 1/1 0/0 0/0 1/1 0/0 0/0 0/0 0/0 0/0 0/0 1/1 0/0 0/0 0/0 0/0 0/0 1/1 0/0 0/0 1/1 0/0 1/1 1/1 0/0 0/0 0/0 1/1 0/0 0/0 0/0 0/0 0/0 1/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 1/1 0/0 0/0 0/0 0/0 1/1 0/0 0/0 1/1 0/0 0/0 0/0 0/0 0/0 0/0 1/1 0/0 0/0 0/0 0/0 0/0 1/1 1/1 1/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 1/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 1/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 1/1 0/0 0/0 0/0 0/0 0/0 0/0 1/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 1/1 0/0 0/0 1/1 0/0 1/1 0/0 0/0 1/1 0/0 0/0 0/0 0/0 0/0 1/1 1/1 0/0 0/0 0/0 1/1 0/0 From a bad sub-vcf: 0 0 AAGACTGGACAATCACTTACTAGCACGCTCC 1 . . . PR GT ./. 0/0 0/0 0/0 ./. ./. 0/0 ./. 0/0 ./. ./. 0/0 ./. ./. 0/0 ./. 0/0 0/0 0/0 ./. 0/0 0/0 ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. 0/0 ./. 0/0 ./. ./. 0/0 0/0 0/0 ./. ./. ./. ./. ./. ./. ./. ./. ./. 0/0 0/0 0/0 ./. 0/0 ./. 0/0 ./. ./. ./. 0/0 ./. 0/0 ./. 0/0 ./. ./. ./. ./. 0/0 ./. 0/0 ./. ./. ./. 0/0 0/0 0/0 ./. ./. ./. ./. ./. 0/0 0/0 ./. ./. 0/0 ./. ./. ./. ./. 0/0 0/0 0/0 0/0 ./. ./. ./. ./. ./. ./. ./. 0/0 ./. ./. ./. 0/0 ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. 0/0 ./. ./. ./. ./. ./. ./. 0/0 ./. 0/0 0/0 0/0 0/0 ./. ./. 0/0 0/0 ./. ./. 0/0 ./. ./. ./. ./. 0/0 0/0 0/0 ./. ./. ./. ./. ./. 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 ./. 0/0 0/0 0/0 ./. ./. ./. ./. ./. ./. ./. ./. ./. 0/0 ./. ./. ./. ./. ./. 0/0 ./. 0/0 0/0 0/0 ./. ./. ./. ./. ./. ./. ./. 0/0 0/0 ./. ./. ./. 0/0 ./. ./. ./. ./. ./. ./. ./. ./. 0/0 ./. 0/0 0/0 0/0 0/0 ./. ./. 0/0 0/0 ./. 0/0 0/0 0/0 0/0 ./. ./. ./. ./. 0/0 ./. ./. 0/0 ./. 0/0 0/0 ./. 0/0 ./. ./. ./. 0/0 0/0 0/0 ./. 0/0 0/0 0/0 ./. ./. ./. 0/0 0/0 ./. ./. 0/0 0/0 ./. 0/0 0/0 0/0 0/0 ./. ./. ./. 0/0 0/0 ./. 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 ./. ./. ./. ./. ./. ./. 0/0 ./. 0/0 0/0 0/0 ./. 0/0 0/0 0/0 0/0 0/0 0/0 ./. 0/0 0/0 ./. ./. ./. 0/0 ./. 0/0 0/0

voichek commented 1 year ago

Dear Lauren,

Thank you. I tried to cut also my A. thaliana k-mers-table to many plink binary file and it didn't show your bug.

  1. Regarding my previous point (3) can you show the output of filter_kmers for the two k-mers you outputted above? I would expect the the ./. would stand for "1/1" but there is some glitch in the binary format encoding for some reason. Also the two fields after the k-mer change (N & 1 --> 1 & ".") which also indicate some bit glitch.
  2. Do you see the "./." also when you convert only one of the sub-file outputted from the kmers_table_to_bed ? Or do you see this only after you merge?
  3. How many k-mers do you have in your k-mers table? Maybe it is larger than my test set and this is why I didn't notice it before.
  4. Regarding point (1) from before, I checked and if you define "-b" twice it will not fail, and might take the second value. As you have so many sub-files I guess it took the second definition in your command run. If you set -b to a high number, I would go for 1,000,000,000, not to pass the 32 number, do you still have the issue with the "./.".
  5. Also regarding your line when running the kmeres_table_to_bed, you have "\" after the output directory, which shouldn't be there. I don't think it can make any issue like this, but just to make everything clean.

Yoav

renhamm commented 1 year ago

I'm currently running the requested tasks above, but the kmers_table.table file even when compressed is much to large to upload (47799 MB). Is there an email that I can send it to or share it with?

voichek commented 1 year ago

If you have a link to a drive or ftp you can send it to me to yoav.voichek [at] gmi.oeaw.ac.at.

btw, I have thought about it since and I am most curious about (2) from before. Did you see the "./." when you converted single sub-files or only after merging? because it you don't see it in single files it is an issue with PLINK. I suspect this might be the issue as there are the extra fields which are different (point (1)) and I don't see how I could have changed them in my files.

voichek commented 1 year ago

Hi Lauren,

I am still very curious about what happened here. However, as I see no activity here, should we close it for now?

Best, Yoav

renhamm commented 1 year ago

Don't close! I've been processing things slower for a combination of medical reasons and server failures on campus. I'm hoping to get back to you very very shortly.

On Tue, Feb 7, 2023 at 12:22 AM Yoav Voichek @.***> wrote:

Hi Lauren,

I am still very curious about what happened here. However, as I see no activity here, should we close it for now?

Best, Yoav

— Reply to this email directly, view it on GitHub https://github.com/voichek/kmersGWAS/issues/128#issuecomment-1420373120, or unsubscribe https://github.com/notifications/unsubscribe-auth/AO2GFKUJ6GKAMWVAJVP57V3WWIA3NANCNFSM6AAAAAATZYXXKQ . You are receiving this because you authored the thread.Message ID: @.***>

voichek commented 1 year ago

Dear Lauren,

No worries. Have a swift recovery.

I just make sure that there are no "orphan" issues left behind. In this case, take your time :)

Best, Yoav

renhamm commented 1 year ago

I am currently rerunning the PLINK bed-to-vcf conversion individually to see if this fixes it. When I ran one of the bed files which resulted in the abberant "./." option, it ended up looking fine. This makes me think that there was some error PLINK had in looping through all of the files. I can't guarantee that this will fix it, but it's worth a shot while I wait for our internal data transfer servers to come back online. Hopefully this helps!

renhamm commented 1 year ago

It looks like that fixed it! I think there was a misinstallation of an older version of plink (1.07 instead of 1.9) that led to an inappropriate conversion of the "1/1" to "./."

Thanks! We can close this issue now

On Tue, Feb 7, 2023 at 1:34 PM Yoav Voichek @.***> wrote:

Dear Lauren,

No worries. Have a swift recovery.

I just make sure that there are no "orphan" issues left behind. In this case, take your time :)

Best, Yoav

— Reply to this email directly, view it on GitHub https://github.com/voichek/kmersGWAS/issues/128#issuecomment-1421484378, or unsubscribe https://github.com/notifications/unsubscribe-auth/AO2GFKW4X7KCHASOKJVAVNTWWK5X5ANCNFSM6AAAAAATZYXXKQ . You are receiving this because you authored the thread.Message ID: @.***>