meyer-lab-cshl / plinkQC

R package for quality control of plink genetic datasets
Other
55 stars 28 forks source link

Possible Bug relatednessFilter #11

Closed GabrieleNocchi closed 5 years ago

GabrieleNocchi commented 5 years ago

The function relatednessFilter throws the following error:

Error in combn(rel, 2) : n < m

I have tried both plink --genome output files as well as custom made files, the error persists.

HannahVMeyer commented 5 years ago

Could you please provide more context - which version are you running, what it is the data you are using? Could you please provide a minimal example.

GabrieleNocchi commented 5 years ago

Hi. I am using the 0.2.2. CRAN version.

I am simply producing a genome file using plink with:

plink --vcf my.vcf --allow-extra-chr --genome

My VCF is my data, I can't share it. It includes SNPs for about 500 oak trees. It is a normal vcf.

When I run relatednessFilter with the output of IBD (plink.genome) I get that error.

As I am not a fan of plink IBD calculations, I also have a calculated relatedness with an R package, synbreed. It makes a matrix of relatedness and from the matrix I created a file with 3 columns: ID1, ID2 and Relatedness. Of course for this, as my column names are different from plink output, I specified this in my relatednessFilter command. So ID1 is relatednessIID1, relatednessIID2 is the second, and relatednessRelatedness is my Relatedness column (normally PI_HAT) and other criterion is my ID1 column. Also here I get the same error.

Does this work for you with a standard plink --genome file created from a vcf? Can you test it with just any vcf? (I have tried different vcf also with different dataset, same error).

GabrieleNocchi commented 5 years ago

Just to add, I have no issue in using the function check_relatedness with the same dataset. However I want to use relatednessFiltr because I do not want to use plink output. check_relatedness seems to need specifically plink output. IF check_relatedness can be used with custom files, by changing the column names in the command like in relatednessFilter, I'd be happy to use that.

All I want to do is a list with the minimum number for individuals to remove from my dataset to remove any relatedness above a certain threshold.

Many Thanks

HannahVMeyer commented 5 years ago

I have never tried with output generated from a vcf file - at the moment plinkQC only supports the binary plink format. However, I don't see how the IBD file would be different when generated from vcf. Could you send me the data confidentially with anonymised IDs?

You could also run check_relatedness in debug mode and tell me at which line it fails.

GabrieleNocchi commented 5 years ago

Yes I don't see how this could make a difference because relatednessFilter only takes the .genome output which does not change whether you create it from a vcf or a binary plink file, it is always the same. Let me try and get you a subset of my custom file.

GabrieleNocchi commented 5 years ago

Ok. I have attached my custom file. So these samples are already all above my threshold (already filtered with awk very simply). So basically what I am interested in is simply to get a list of suggested samples to remove (ie. the minimum number of samples to remove to get rid of relatedness, which I think that's what relatednessFilter does, from the docs https://www.rdocumentation.org/packages/plinkQC/versions/0.2.2/topics/relatednessFilter )

My command:

a <- relatednessFilter(z, otherCriterion = NULL, relatednessTh = 0.1875, otherCriterionTh = NULL, otherCriterionThDirection = c("gt", "ge", "lt", "le", "eq"), relatednessIID1 = "colIDs", relatednessIID2 = "rowIDs", relatednessFID1 = NULL, relatednessFID2 = NULL, relatednessRelatedness = "values", otherCriterionIID = "colIDs", otherCriterionMeasure = NULL)

z is simply the file I have attached, read it in with : z <- read.table("related.txt") related.txt

GabrieleNocchi commented 5 years ago

This instead is my .genome file.

As I said, I do not want to use this as plink IBD calculations are not to be given much trust. But the error persists also using this file. I changed the extension to .txt to be able to upload it here, but it is a ,genome file. nomiss.txt

GabrieleNocchi commented 5 years ago

I have never tried with output generated from a vcf file - at the moment plinkQC only supports the binary plink format. However, I don't see how the IBD file would be different when generated from vcf. Could you send me the data confidentially with anonymised IDs?

You could also run check_relatedness in debug mode and tell me at which line it fails.

IMPORTANT: Maybe you misunderstood my previous message. check_relatedness works! but it only takes the real .genome file and also wants the imiss file. I was trying to avoid to format my custom file (related.txt) to look like a plink.genome file. So I want to use relatednessFilter because can take any file you just specify the column names where the data is.

HannahVMeyer commented 5 years ago

There were two issues:

  1. I did not include a check for the case that all individuals provided would be related
  2. I did not check that the IDs in the user-provided file were characters, in your case they were factors.

Both are fixed now. Please reinstall via devtools from github.

GabrieleNocchi commented 5 years ago

Hi. It works now, the bug is solved. But still, I think the function is not behaving as expected (maybe I misunderstood it, maybe it s me wanting to do something different, please let me know). Let me just report my findings:

Use the file I passed you, related.txt

My code

aaa <- relatednessFilter(j, otherCriterion = NULL, relatednessTh = 0.1875, otherCriterionTh = NULL, otherCriterionThDirection = c("gt", "ge", "lt", "le", "eq"), relatednessIID1 = "colIDs", relatednessIID2 = "rowIDs", relatednessFID1 = NULL, relatednessFID2 = NULL, relatednessRelatedness = "values", otherCriterionIID = "colIDs", otherCriterionMeasure = NULL)

j is: j <- read.table("related.txt").

Ok from the output of aaa$failIDs it suggests to exclude 127 samples, which is simply the number of unique samples appearing in all pairwise comparison above the threshold (in my file, all of them).

Well, this is not what I want to do. I show you an example from that dataset.

Look at sample T19 and T18 for example. In my file related.txt they only appear in 1 comparison where they are above 0.1875, and that is when compare with each other. So my intendend behaviour here is to fail either T19 or T18 but NOT BOTH. instead aaa$failIDs suggest to remove both.

As you probably noticed, my related.txt file is created from a relatedness matrix which is a symmetric matrix, so each comparison shows up twice, which I thought could be the issue of why both T19 and T18 are excluded in this particular example. To better understand this basically the comparisonbetween T19 and T18 appaear twice in related.txt, first as T19 vs T18 and then as T18 vs T19.

I have tried to sort my file, related.txt, by the relatedness value and then remove each other line from it (this so that all the comparisons appear only once, like in plink IBD output) (attached file final.txt) however the following error is thrown:

Error in if (relatednessFID2 %in% names(relatedness_original)) { : argument is of length zero

I am not sure what is wrong now.

What i really want to do is to remove the least amount of samples to basically remove relatedness (defined above my threshold).

Note: in example below, when I say "related" I mean above my threshold. For example: if A is related to B and B is related to C, but between A and C the relatedness is below the threshold (so are unrelated for me) I want to remove only B. If X and Y are related with each other and nothing else, I want to remove either X or Y, but not both.

Basically, if I had to do this by hand, I would draw a sort of pedigree network and at the end remove the hubs, basically.

It is possible that I have misunderstood what relatednessFilter does, in that case, sorry.

final.txt

GabrieleNocchi commented 5 years ago

Yes I have been trying to debug this all day and I can't understand what is going on. Depending on which lines I remove from related.txt, the function thorws that error:

Error in if (relatednessFID2 %in% names(relatedness_original)) { : argument is of length zero

In my final file, final.txt, that's the error. I have also tried subsetting related.txt and I noticed this behaviour:

The error is thrown if I remove some lines, if I remove others the function works.

Also I do not understand the error because relatednessFID2 is set to NULL in the relatednessFilter command so it is normal that it has lenght zero I believe.

HannahVMeyer commented 5 years ago

You are correct, it should filter in a way that will retain the maximum number of individuals in the analysis, which is doesn't seem to be working with my quick fix this morning.

I am guessing that the function works if you keep some unrelated individuals in the text file and it might fail if all individuals in the file are related? I believe I know where the problem lies. I will be able to address this later today.

GabrieleNocchi commented 5 years ago

I am not sure what happens, but it might be the case. Just had one last try now and it seems it could be like you said, the function does not work if you accidentally remove all unrelated individuals or some, but I am not sure because if you look at my files, what was very strange is that related.txt and final.txt have the same individuals and relatedness, I simply removed the double observations like I explained (if I have A vs B and B vs A, I only keep A vs B because it has the same value and I thought that this could be the cause of the wrong filtering), but it throws the error still.

And Yes i managed to do another file, basically I removed one copy of all the repetitive observations as explained above (A vs B, B vs A) derived from my symmetric matrix but I avoided to pre-filter by relatedness before using your function, so basically I gave it a bigger file with the same format as final.txt, no repetitve observations but my entire dataset not filtered (so also with relatedness well-below 0.1875). The function works, do not throw the error in this case but the filtering still have the same issues (ie. does not remove the minimum number of samples).

I think there is something wrong in the background. I start to be very confused now, I wiill check back tomorrow!

Thank you, I will follow this thread.

GabrieleNocchi commented 5 years ago

Some information, if you use my final file final.txt for testing.

This file has the pairwise relatedness that were above 0.1875 in my entire dataset (pre-filtered). The file has 144 data lines, so 144 pairwise comparison in my entire dataset resulted in relatedness above the threshold of 0.1875. In this 144 pairwise estimates there are a total of 127 samples. I don't know the pedigree and most of the individuals could actually be related so most could be correctly be suggested for removal by your function.

However I know for sure of 3 pairs that are only related to each other and nothing else, and by looking through final.txt I can also find more. So the function should not suggest me to remove all the samples. These 3 pairs can be used to check if the function has worked and it should only remove one sample from each pair.

These are 3 pairs: T18 and T19 AG8 and FC10 BA6 and T11

I initially thought that the filtering problem was in my file related.txt . That file was created by filtering a symmetric matrix which was transformed into a 3 columns table and then filtered to get only the rows with relatedness above 0.1875. So as I explained, each observations appeared twice in related.txt, but with the samples in the opposite order. With this file the function worked but as I pointed out, it suggested to remove all 127 samples even though I know for sure some should not be removed.

By looking at the code, I thought that for each failing comparison where there are not other relation, the first or second sample is removed(the sample specified in the column of the otherCriterionIIDs I think?). So I thought maybe the function finds the same comparison twice in reverse order, so when it firs encounter it remove one sample, and when it finds the same again but in reverse it removes the other.

But this is not what happens, as with final.txt an error still pops up: Error in if (relatednessFID2 %in% names(relatedness_original)) { : argument is of length zero

I have also tried running the command by removing both the "otherCriterionThDirection" option as well as the "otherCriterionIID" options but still the same error.

relatednessFilter(j, otherCriterion = NULL, relatednessTh = 0.1875, otherCriterionTh = NULL, relatednessIID1 = "colIDs", relatednessIID2 = "rowIDs", relatednessFID1 = NULL, relatednessFID2 = NULL, relatednessRelatedness = "values", otherCriterionIID = "colIDs", otherCriterionMeasure = NULL)

relatednessFilter(j, otherCriterion = NULL, relatednessTh = 0.1875, otherCriterionTh = NULL, relatednessIID1 = "colIDs", relatednessIID2 = "rowIDs", relatednessFID1 = NULL, relatednessFID2 = NULL, relatednessRelatedness = "values", otherCriterionMeasure = NULL)

While I originally called it like:

relatednessFilter(j, otherCriterion = NULL, relatednessTh = 0.1875, otherCriterionTh = NULL, otherCriterionThDirection = c("gt", "ge", "lt", "le", "eq"), relatednessIID1 = "colIDs", relatednessIID2 = "rowIDs", relatednessFID1 = NULL, relatednessFID2 = NULL, relatednessRelatedness = "values", otherCriterionIID = "colIDs", otherCriterionMeasure = NULL)

HannahVMeyer commented 5 years ago

Hi Gabriele, please try installing the latest github version. Your problems should be fixed now. Thanks for your patience and for providing the example data and details above.

GabrieleNocchi commented 5 years ago

I will do first thing tomorrow and report back . Thanks for fixing this!

GabrieleNocchi commented 5 years ago

Hi.

So, we are getting there. The behaviour of the function is in the right direction but it does something quite confusing probably when trying to deal with a symmetric matrix derived table and it is not necessary. In final.txt the double observations are already removed.

I noticed this because I have 144 pairwise relatedness in final.txt and after using your function the output value $relatednessFails shows more comparison than those that I had at the beginning (it shows 190). In final.txt all comparisons are above the threshold so $relatednessFails should include exactly all of these, 144. By looking at the differences from final.txt and $relatednessFails, it seems that the function adds the reciprocal for some sample comparisons, while for other it does not. For example, sample AA10 is only in 4 comparisons (4 lines) in final.txt but in $relatednessFails it appears 7 times, as the function has "created" 3 reciprocal comparisons.

Look (test.txt is $relatednessFails calculated from final.txt and saved in a file with write.table). I simply grep to show differences for that sample between final.txt and the output test.txt:

gabriele: ~ 10:50:22: grep "AA10" final.txt "154" "AA10" "AH9" 0.310484021428168 "643" "AA10" "z93" 0.29911535350326 "6" "AC7" "AA10" 0.223387696206766 "5" "AC12" "AA10" 0.205118541353243 gabriele: ~ 10:50:26: grep "AA10" test.txt "70" "AA10" "AH9" 0.310484021428168 "71" "AA10" "z93" 0.29911535350326 "72" "AA10" "AC7" 0.223387696206766 "73" "AA10" "AC12" 0.205118541353243 "123" "AC12" "AA10" 0.205118541353243 "160" "z93" "AA10" 0.29911535350326 "165" "AH9" "AA10" 0.310484021428168 `

On the other side some other samples are fine, like for example AA11. If I may suggest I would not worry about the function having to deal with symmetric data, I think is fine for users like me coming from a symmetric matrix to at least prepare the file by taking one part of the matrix only (upper or lower triangle) so not to have the same relatedness twice and also remove the diagonal (which has the self to self relatedness), like I did in final.txt .

I think the function now does something trying to deal with double observations but they are not there in this case (they are in related.txt file but not in final.txt).

I noticed this because on Monday I actually did this task this semi-programatically with Perl and I know that from final.txt (127 samples) I can remove 67 samples and the relatedness above 0.1875 disappears. While the output of your function $failIDs suggest to remove 73. So we are getting there but something is still not perfect.

I saw a mistake when looking through a$failIDs. Sample EC10, is suggested for removal. In my code instead I save this sample. This sample is only related to ED1 and B81 and $failIDs suggest to remove all 3, while EC10 is not related to anybody anymore if you remove B81 and ED1. So removing all 3 would be a waste. My results with my code are not necessarily the absolute mathematical minimum number of samples to remove (maybe I could actually remove LESS) but I know that I can remove 67 or less. So if the function suggest to remove more than 67, there is some issue.

Another possible reason for this issue may be my IDs? I don't know how your code works (I don't code in R much) but maybe it gets confused because I have this sort of IDs like for example ED1 and ED10 so if you have a regex or grep system in your code looking for samples for example ED1 it also grab ED10 if you are not careful. Not sure, just telling some of the issues I had before with this dataset. Sometimes is silly things like this .

HannahVMeyer commented 5 years ago

Thanks for the feedback:

GabrieleNocchi commented 5 years ago

Thanks for explaining. Yes about your first bullet point, I think it is explained in the documentation, thanks for clarifying. I am generally interested in $failIDs and that's the first thing I looked in the output and when I saw that it suggested to remove 73 samples (6 more than expected) I just looked at relatednessFails and saw the repetitive pairwise data points and I thought that was the issue. So I pointed that out as it seemed a bit odd, but makes sense now.

Thanks yes I think you are on the right track and that should sort it. And yes I agree, with plants relatedness gets quite complex.

GabrieleNocchi commented 5 years ago

Just to add in case you use my data to test:

I double checked my relatedness filter perl script and I made a couple of mistakes. Before I said I could remove 67 samples or less from my final.txt to get rid of relatedness, but actually I left in 3 extra samples (accidentally) which should have been removed. So in final.txt your function should be able to remove 70 or less samples to get rid of relatedness not 67 or less as I said previously.

HannahVMeyer commented 5 years ago

The relationship filter now deals with more complicated relationship scenarios as observed in your example data by construction subgraphs of clusters of individuals that are related. It finds the maximum independent set of vertices in the subgraphs of related individuals. If all individuals are all related (ie all maximum independent sets are 0), one individual of that cluster will be kept and all others listed as failIDs.

For instance, in your data set:

In your test data, 66 IDs are now classified as fail. I will update the documentation in the next few days and release a new minor version of the package.

In the meantime, would you mind trying again with the version on the development branch:

devtools::install_github("HannahVMeyer/plinkQC", ref="dev")
GabrieleNocchi commented 5 years ago

Thank you. I am currently on my annual leave, I will test it once I am back in the office at the end of the month.


From: HannahVMeyer notifications@github.com Sent: Monday, August 5, 2019 10:38:09 PM To: HannahVMeyer/plinkQC plinkQC@noreply.github.com Cc: Gabriele g.nocchi@hotmail.com; State change state_change@noreply.github.com Subject: Re: [HannahVMeyer/plinkQC] Possible Bug relatednessFilter (#11)

The relationship filter now deals with more complicated relationship scenarios as observed in your example data by construction subgraphs of clusters of individuals that are related. It finds the maximum independent set of vertices in the subgraphs of related individuals. If all individuals are all related (ie all maximum independent sets are 0), one individual of that cluster will be kept and all others listed as failIDs.

For instance, in your data set:

*case 2: Y27, EC9, B71, Y6 build a sub-cluster where they are all related to another, ie all pairwise distances are 1. The algorithm will keep one individual (first in sort) and remove the others, ie it keeps B71.

In your test data, 66 IDs are now classified as fail. I will update the documentation in the next few days and release a new minor version of the package.

In the meantime, would you mind trying again with the version on the development branch:

devtools::install_github("HannahVMeyer/plinkQC", ref="dev")

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHubhttps://github.com/HannahVMeyer/plinkQC/issues/11?email_source=notifications&email_token=AEGOPD6ZCO2N3PJRKK43ZTDQDCFTDA5CNFSM4IHC43EKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3TADII#issuecomment-518390177, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AEGOPD2F7HVYMKJEQI4L5XTQDCFTDANCNFSM4IHC43EA.

GabrieleNocchi commented 5 years ago

Hi,

I had a chance to test it and everything seems to be working as expected now!

Many Thanks!

HannahVMeyer commented 5 years ago

Great, thanks for testing it! Closed and merged with master