tatsuhikonaito / DEEP-HLA

Upload test
Other
17 stars 10 forks source link

Problem with impute.py #9

Open iaia87 opened 1 year ago

iaia87 commented 1 year ago

Hello,

I am currently using DEEP-HLA for the first time, and I have encountered an issue when running the impute.py script. Specifically, I am receiving the following error:

Traceback (most recent call last): File "impute.py", line 235, in main() File "impute.py", line 231, in main impute(args) File "impute.py", line 156, in impute result_phased.loc[hla_info[hla][digit], sample_fam_batch.iid] = phased File "/home/laura/.pyenv/versions/env_deep_hla-374/lib/python3.7/site-packages/pandas/core/indexing.py", line 205, in setitem self._setitem_with_indexer(indexer, value) File "/home/laura/.pyenv/versions/env_deep_hla-374/lib/python3.7/site-packages/pandas/core/indexing.py", line 593, in _setitem_with_indexer self.obj._data = self.obj._data.setitem(indexer=indexer, value=value) File "/home/laura/.pyenv/versions/env_deep_hla-374/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 560, in setitem return self.apply("setitem", kwargs) File "/home/laura/.pyenv/versions/env_deep_hla-374/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 438, in apply applied = getattr(b, f)(kwargs) File "/home/laura/.pyenv/versions/env_deep_hla-374/lib/python3.7/site-packages/pandas/core/internals/blocks.py", line 940, in setitem values[indexer] = value ValueError: shape mismatch: value array of shape (17,962) could not be broadcast to indexing result of shape (34,962)

I have tried to understand the problem, but as a novice in Python, I am struggling to find a solution. I have successfully run the training.py script, so I believe the issue is related to the impute.py script.

I would greatly appreciate any help or advice you can provide.

Thank you, Laura

iaia87 commented 1 year ago

Hello,

Just a quick update: the script works when I use the 2-digit option, so it may be related to the HLA gene groups with more than 2 genes. However, quite strangely, I get as dosage that all my individuals are HLA-A*01, for example, even though my population is multiethnic, which seems really strange to me... Any ideas to help solve this problem? Thank you very much for your help and for the package!

Best regards, Laura

iaia87 commented 1 year ago

Hello,

I have a solution for the "ValueError: shape mismatch: value array of shape (17,962) could not be broadcast to indexing result of shape (34,962)" issue. However, there is an inconsistency in the dosage file when compared to other imputation methods. Specifically, only one HLA allele has a dosage of 2.0 while the rest have a dosage of 0.0. This result is not consistent with the results obtained from Minimac imputation. Although no errors occurred during the imputation or training process, I would appreciate any suggestions on how to resolve this issue.

Thank you for your assistance.

Best regards, Laura

tatsuhikonaito commented 1 year ago

Hi Laura,

Thank you for using Deep*HLA. I could not fully understand what you meant by "only one HLA allele has a dosage of 2.0 while the rest have a dosage of 0.0.". Does this mean that all individuals were imputed as the same one HLA allele being a dosage of 2.0, or that only one HLA allele had a dosage of 2.0 for each individual and that such alleles differed by individuals? It may be also helpful if you could tell me which HLA allele of which HLA gene.

Best, Tatsuhiko

iaia87 commented 1 year ago

Hello,

Thank you for your prompt response. I apologize if my previous explanation was not clear enough. Allow me to clarify further. The issue that I am facing pertains to every HLA gene. Each gene has one allele (for instance, HLA-A has HLA_A*01 with a dosage of 2.0 for everyone, while the other allele has 0.0 for everyone) in both 2-digit and 4-digit formats.

I believe that I have identified the problem, but I would appreciate it if you could confirm it. I utilized a 1000G multi-ethnic reference for imputation and encountered some issues while generating the HLA model using the provided script. As a result, I had to develop the code myself. I specified only one position for each gene (for example, for HLA-A: 29910247 and for HLA-B: 31321649) in the HLA model. However, the reference contains different positions for various alleles (for instance, for HLA-A, the positions range from 29910247 to 29910299). Thus, I believe that during model training, only the single position that I assigned to the HLA model was considered.

Could you please let me know if it is possible to specify an interval in the HLA model.json file? If that is the case, should I use 29910247-29910299 or 29910247:29910299?

This is just a theory, and I am open to other suggestions you may have.

Thank you and best regards, Laura

Il giorno mar 18 apr 2023 alle ore 20:14 tatsuhikonaito < @.***> ha scritto:

Hi Laura,

Thank you for using Deep*HLA. I could not fully understand what you meant by "only one HLA allele has a dosage of 2.0 while the rest have a dosage of 0.0.". Does this mean that all individuals were imputed as the same one HLA allele being a dosage of 2.0, or that only one HLA allele had a dosage of 2.0 for each individual and that such alleles differed by individuals? It may be also helpful if you could tell me which HLA allele of which HLA gene.

Best, Tatsuhiko

— Reply to this email directly, view it on GitHub https://github.com/tatsuhikonaito/DEEP-HLA/issues/9#issuecomment-1513602510, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALB4SAYFZD5QIVBN7QDONJLXB3KWXANCNFSM6AAAAAAXA53XMM . You are receiving this because you authored the thread.Message ID: @.***>

tatsuhikonaito commented 1 year ago

Hi Laura,

The current version does not support multiple positions or ranges in HLA json files. I think that one of the quick ways is to modify the bim file of the reference data so that the positions of HLA alleles of the same HLA gene are the same. I was not aware of this issue and may update Deep*HLA to address it in future versions. Thank you for the information.

Best, Tatsuhiko

iaia87 commented 1 year ago

Hi there,

My apologies for the delay in getting back to you. I wanted to let you know that I have changed my reference panel and retrained my model, but unfortunately, I am still encountering the same problem. Specifically, I observed that only one allele for each HLA gene had a dosage of 2, while the other allele had a dosage of zero.

I was wondering if you had any advice on how to solve this issue? Do you think that it could be related to the training parameters?

Also, if you're interested, I would be happy to share my scripts with you.

Thank you for your time and help.

Best regards, Laura

iaia87 commented 1 year ago

Hello,

I am wondering if the problem could be related to the reference panel I am using. Specifically, I am using a multiethnic panel, and I was wondering if you think that I should try using a panel that is specific to a single population.

Additionally, I was wondering if you have any advice on how to test only a single HLA gene?

Thank you for your time and assistance.

Best regards, Laura

tatsuhikonaito commented 1 year ago

Hi Laura,

My apologies for the delayed response. I don't think that the problems were from training parameters. Which reference panel did you use?

Best, Tatsuhiko

iaia87 commented 1 year ago

Hi,

I hope this email finds you well. I wanted to take a moment to express my gratitude for valuable insights during our previous exchange. I used a multiethnic reference panel from 1000G.

After our discussion, I diligently followed your suggestion to update the position of the HLA allele and AA, which successfully resolved one of the issues I was facing. By referring to your model reference, I also eliminated the special character "*" from my reference, which had caused an earlier problem with the "make_reference" script. I am delighted to report that this modification has resulted in smooth functionality within that particular aspect of the package.

However, I regret to inform you that despite these modifications, I am still encountering the same error during the training phase. Upon further examination, I noticed that one of the special characters (":") remains present in my dataset, and I suspect it may be contributing to the issue. As a next step, I am considering running the training without this character to determine if it alleviates the error. Additionally, I am contemplating making further adjustments, such as narrowing down the population to exclusively European subjects and changing the reference panel to the diabetes European panel, in an effort to ascertain the impact of these different parameters on the training process.

Given the complexity of the situation, I wanted to inquire whether it would be possible to share my reference panel with you for testing purposes. This would provide you with the opportunity to replicate the issue and offer more specific guidance. Alternatively, if you have a short period of time available, I would greatly appreciate scheduling a call to discuss the matter in greater detail. I believe that a conversation would enable us to delve deeper into the intricacies of the problem and explore potential solutions more effectively.

On a separate note, I would also like to take this opportunity to provide some constructive feedback on your GitHub tutorial. While I found the tutorial to be highly informative and instrumental in my work, I noticed a few areas where additional clarifications or tips could greatly benefit other users. Specifically, paying attention to certain details in the Beagle format (such as the 5 header lines) when using train.py can help avoid index-related issues.for newcomers (like me :) !).

Once again, I am immensely grateful for your exceptional package and the collaborative effort we have embarked upon. Your assistance in resolving this lingering issue would be invaluable to the success of my project. Thank you for your time, support, and willingness to work together.

Best regards, Laura

Il giorno mar 30 mag 2023 alle ore 01:28 tatsuhikonaito < @.***> ha scritto:

Hi Laura,

My apologies for the delayed response. I don't think that the problems were from training parameters. Which reference panel did you use?

Best, Tatsuhiko

— Reply to this email directly, view it on GitHub https://github.com/tatsuhikonaito/DEEP-HLA/issues/9#issuecomment-1567593980, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALB4SAY5JXCDACW4P6PITALXIUWIDANCNFSM6AAAAAAXA53XMM . You are receiving this because you authored the thread.Message ID: @.***>

tatsuhikonaito commented 1 year ago

Hi Laura,

It would be no problem for me to receive your reference panel as long as your institution allows. It is also fine with me to have a discussion on a call although it will depend on my schedule. Could you please email me since this platform would be more suitable for general inquiries about tools?

In addition, thank you very much for suggesting updating the manual of Deep*HLA. Actually, I am currently working on that in relation to other work. A more useful manual will be provided also on this repository.

Best regards, Tatsuhiko