Providing sample ids with bgen files

jean997 commented 3 years ago

Hi, I am running Predict.py with some bgen files. I actually have two issues. The first is that I am unable to pass sample IDs. I tried supplying the sample file with the --text_sample_ids command but I still see the message "Sample IDs are not present in this file. I will generate them on my own". Is there another way I should supply them?

The second issue is that the run time is very long. So far many hours. I am wondering what the normal running time for predicting from the GTeX v8 models should be.

Thanks! Jean

hakyim commented 3 years ago

How large is your data?

Yanyu (cc’d) has performed prediction in the UK biobank and may be able to help.

Haky

On Thu, Nov 12, 2020 at 4:16 PM Jean Morrison notifications@github.com wrote:

Hi, I am running Predict.py with some bgen files. I actually have two issues. The first is that I am unable to pass sample IDs. I tried supplying the sample file with the --text_sample_ids command but I still see the message "Sample IDs are not present in this file. I will generate them on my own". Is there another way I should supply them?

The second issue is that the run time is very long. So far many hours. I am wondering what the normal running time for predicting from the GTeX v8 models should be.

Thanks! Jean

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/hakyimlab/MetaXcan/issues/108, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAW2ROMTQLUKZFCKR6HYXA3SPRNDDANCNFSM4TT3ERMA .

jean997 commented 3 years ago

Oh perfect. It is UK Biobank data. We have imputed bgen files which I think are about 1.3 million SNPs by about 500k individuals.

liangyy commented 3 years ago

Hey Jean,

Honestly, I did not use BGEN files with Predict.py before, and neither --text_sample_ids. To address your question, I did a small test run. I checked with the situation where the BGEN file has sample ID and --text_sample_ids can let me successfully replace the sample IDs.

Maybe you could check if your sample file is in the right format (see reference here)?

Regardless, the warning message you saw comes from the backend bgen-reader (link) and you could assume the sample ordering is consistent with the genotype as always.

Regarding the running time concern, I think you're right, the current Predict.py script is quite slow on handling BGEN files. From my experience, I think rbgen is faster than bgen-reader (the backend of Predict.py).

My UKB job was done using another script written by @miltondp which uses rbgen as the backend. To give you a reference, it took about 24 hrs to predict one tissue for the UKB cohort. If you're interested in trying this option out, I could share with you the script. For a test run on 50 samples, Predict.py takes 30 min and that script takes <1 min.

Thanks!

Yanyu

jean997 commented 3 years ago

Thanks Yanyu! I would love to check out that script. It sounds like it would be very helpful for us. Jean

On Thu, Nov 12, 2020 at 7:48 PM Yanyu Liang notifications@github.com wrote:

Hey Jean,

Honestly, I did not use BGEN files with Predict.py before, and neither --text_sample_ids. To address your question, I did a small test run. I checked with the situation where the BGEN file has sample ID and --text_sample_ids can let me successfully replace the sample IDs.

Maybe you could check if your sample file is in the right format (see reference here https://github.com/hakyimlab/MetaXcan/wiki/Individual-level-PrediXcan:-introduction,-tutorials-and-manual#text-dosage-format )?

Regardless, the warning message you saw comes from the backend bgen-reader (link https://github.com/limix/bgen-reader-py/blob/1712a358fc5a9948868eead6b6d70e287e21cf35/bgen_reader/_samples.py#L15) and you could assume the sample ordering is consistent with the genotype as always.

Regarding the running time concern, I think you're right, the current Predict.py script is quite slow on handling BGEN files. From my experience, I think rbgen is faster than bgen-reader (the backend of Predict.py).

My UKB job was done using another script written by @miltondp https://github.com/miltondp which uses rbgen as the backend. To give you a reference, it took about 24 hrs to predict one tissue for the UKB cohort. If you're interested in trying this option out, I could share with you the script. For a test run on 50 samples, Predict.py takes 30 min and that script takes <1 min.

Thanks!

Yanyu

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/hakyimlab/MetaXcan/issues/108#issuecomment-726432533, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACXBKMCOROESCO7NIODRGBTSPR64BANCNFSM4TT3ERMA .

liangyy commented 3 years ago

Hi Jean,

I just cleaned up the script we used. You will find the code at https://github.com/hakyimlab/predixcan_prediction.

Let me know if you encountered any issue using the script above. You could either create an issue there or shoot me an email.

Thanks!

Yanyu

hakyim commented 3 years ago

Let me clarify about access to data. We need to make sure that we are compliant with the UK Biobank data use agreement.

On Fri, Nov 13, 2020 at 2:45 PM Yanyu Liang notifications@github.com wrote:

Hi Jean,

I just cleaned up the script we used. You will find the code at https://github.com/hakyimlab/predixcan_prediction.

Besides, I have predicted expression on several tissues (GTEx V8 CTIMP models). If you need a quick kickstart, I can also share these results with you.

Let me know if you encountered any issue using the script above. You could either create an issue there or shoot me an email.

Thanks!

Yanyu

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/hakyimlab/MetaXcan/issues/108#issuecomment-727021621, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAW2ROJHRKE6XHVH6IMJJQ3SPWLG3ANCNFSM4TT3ERMA .

jean997 commented 3 years ago

Thanks for the script Yanyu. We tried running it but got this error:

File "/predict.py", line 139, in save self.D_samples = self.D_file.create_dataset("samples", (self.n_samples,), dtype='S25')

any ideas?

liangyy commented 3 years ago

hmm .. Could you show the full error message?

jean997 commented 3 years ago

Ah sorry. Here is the whole thing

Traceback (most recent call last): File "/net/snowwhite/home/jvmorr/software/predixcan_prediction/predict.py", line 242, in transcription_matrix.save() File "/net/snowwhite/home/jvmorr/software/predixcan_prediction/predict.py", line 139, in save self.D_samples = self.D_file.create_dataset("samples", (self.n_samples,), dtype='S25') AttributeError: 'TranscriptionMatrix' object has no attribute 'D_file'

liangyy commented 3 years ago

Hi Jean, I suspect that there is no overlap between the SNPs in the predictdb and your BGEN. Now predict.py only supports predictdb using rsID as SNP ID. If you are using the predictdb files generated from predictdb.org, may I ask if you're using elastic net models? If not, maybe give it try first?

jean997 commented 3 years ago

Ok we switched to using the elastic net models and are now getting this error

ERROR: There are not enough rows in your sample file! Make sure dosage files and sample files have the same number of individuals in the same order.

However the sample files and the bgen files do have the same number. I checked by loading a few SNPs using rbgen in R.

liangyy commented 3 years ago

Hi Jean, just to clarify, what sample file you are using? Is it the one shipped with your bgen files?

jean997 commented 3 years ago

Hi Yanyu -- I just discovered by reading the code that it skips the first two lines of the sample file (lines 131 and 132). I added two empty lines to the top and I got the script to run successfully. One question I have, I tested by just trying to predict 5 genes. I only got results for one gene. Do you know why that would be? Maybe not all the genes were in the .db file? I think we got this little list from the mash files so maybe there is not complete overlap.

jean997 commented 3 years ago

As a follow up, do you know what we would need to do to run with the mashr models?

liangyy commented 3 years ago

Hi Yanyu -- I just discovered by reading the code that it skips the first two lines of the sample file (lines 131 and 132). I added two empty lines to the top and I got the script to run successfully.

Nice catch! Yes, the script assume the sample file follows bgen convention and the first two rows are skipped in this sense.

One question I have, I tested by just trying to predict 5 genes. I only got results for one gene. Do you know why that would be? Maybe not all the genes were in the .db file? I think we got this little list from the mash files so maybe there is not complete overlap.

Yes, it is likely the case where we don't have these models in EN models.

As a follow up, do you know what we would need to do to run with the mashr models?

The difficulty of using mashr model is that the prediction models were built using WGS from GTEx and UKB genotype doesn't have these SNPs labeled with rsID. I think there are a couple of workaround options. I will discuss with @hakyim and see if we could release an updated predictdb with SNPs being annotated with UKB genotype SNP ID so that everything is a bit easier.

hakyim commented 3 years ago

We checked the overlap between the GTEx and the UK Biobank SNP sets and they were pretty high, over 95% I believe. So there are two options. Either to ignore the missing SNPs or to impute them. Depending on your goals, I would suggest just to ignore them.

On Sun, Nov 15, 2020 at 1:02 PM Yanyu Liang notifications@github.com wrote:

Hi Yanyu -- I just discovered by reading the code that it skips the first two lines of the sample file (lines 131 and 132). I added two empty lines to the top and I got the script to run successfully.

Nice catch! Yes, the script assume the sample file follows bgen convention and the first two rows are skipped in this sense.

One question I have, I tested by just trying to predict 5 genes. I only got results for one gene. Do you know why that would be? Maybe not all the genes were in the .db file? I think we got this little list from the mash files so maybe there is not complete overlap.

Yes, it is likely the case where we don't have these models in CTIMP.

As a follow up, do you know what we would need to do to run with the mashr models?

The difficulty of using mashr model is that the prediction models were built using WGS from GTEx and UKB genotype doesn't have these SNPs labeled with rsID. I think there are a couple of workaround options. I will discuss with @hakyim https://github.com/hakyim and see if we could release an updated predictdb with SNPs being annotated with UKB genotype SNP ID so that everything is a bit easier.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/hakyimlab/MetaXcan/issues/108#issuecomment-727619525, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAW2ROIVV3ZPNHQPAAUAVCLSQAQTLANCNFSM4TT3ERMA .

liangyy commented 3 years ago

We checked the overlap between the GTEx and the UK Biobank SNP sets and they were pretty high, over 95% I believe. So there are two options. Either to ignore the missing SNPs or to impute them. Depending on your goals, I would suggest just to ignore them.

Is the overlap based on rsID or genomic position? I can see that some of the UKB SNPs do not have rsID even though the same SNPs have rsID in mashr models.

hakyim commented 3 years ago

I forgot the details if the comparison but it was done for the calculation of phenomexcan associations. UK Biobank GWAS results were not imputed, just lifted over. The phenomexcan GitHub repo should have more details on what Milton do. We just used a mapping of UKB snps to Gtex snps.

On Sun, Nov 15, 2020 at 2:16 PM Yanyu Liang notifications@github.com wrote:

We checked the overlap between the GTEx and the UK Biobank SNP sets and they were pretty high, over 95% I believe. So there are two options. Either to ignore the missing SNPs or to impute them. Depending on your goals, I would suggest just to ignore them. … <#m7601277519131911737> On Sun, Nov 15, 2020 at 1:02 PM Yanyu Liang @.***> wrote: Hi Yanyu -- I just discovered by reading the code that it skips the first two lines of the sample file (lines 131 and 132). I added two empty lines to the top and I got the script to run successfully. Nice catch! Yes, the script assume the sample file follows bgen convention and the first two rows are skipped in this sense. One question I have, I tested by just trying to predict 5 genes. I only got results for one gene. Do you know why that would be? Maybe not all the genes were in the .db file? I think we got this little list from the mash files so maybe there is not complete overlap. Yes, it is likely the case where we don't have these models in CTIMP. As a follow up, do you know what we would need to do to run with the mashr models? The difficulty of using mashr model is that the prediction models were built using WGS from GTEx and UKB genotype doesn't have these SNPs labeled with rsID. I think there are a couple of workaround options. I will discuss with @hakyim https://github.com/hakyim https://github.com/hakyim and see if we could release an updated predictdb with SNPs being annotated with UKB genotype SNP ID so that everything is a bit easier. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#108 (comment) https://github.com/hakyimlab/MetaXcan/issues/108#issuecomment-727619525>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAW2ROIVV3ZPNHQPAAUAVCLSQAQTLANCNFSM4TT3ERMA .

Is the overlap based on rsID or genomic position? I can see that some of the UKB SNPs do not have rsID even though the same SNPs have rsID in mashr models.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/hakyimlab/MetaXcan/issues/108#issuecomment-727629575, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAW2ROLG3D7MTFZYT7SEES3SQAZKBANCNFSM4TT3ERMA .

hakyimlab / MetaXcan

Providing sample ids with bgen files #108