piercelab / tcrmodel2

Apache License 2.0
33 stars 6 forks source link

Q: Source of MHC proteins #13

Closed andreas-wilm closed 9 months ago

andreas-wilm commented 9 months ago

Hi TCRmodel2 developers,

Apologies for the slightly off-topic question: may I ask what the source of the MHC I sequences (e.g. GSH..TLQ for HLA-A*23:01) on the website is?

Many thanks, Andreas

rui-yin commented 9 months ago

Hi @andreas-wilm ,

We got the MHC I sequences from IMGT database: https://www.ebi.ac.uk/ipd/imgt/hla/download/

They also have a github repo for convenient downloading of their files: https://github.com/ANHIG/IMGTHLA

Regarding the specific gene you are referring to (HLA-A*23:01), you can find it here: https://raw.githubusercontent.com/ANHIG/IMGTHLA/3540/hla_prot.fasta

Best, Rui

andreas-wilm commented 9 months ago

Thanks @rui-yin!

You seem to have preprocessed the full sequences in that database to just extract the alpha 1 and 2 domains. Would you be able to share the exact process for reproducibility purposes? Did you use Pfam PF00129 / InterPro IPR011161 to achieve this?

Many thanks, Andreas

rui-yin commented 9 months ago

No problem, Andreas, happy to help! We extracted the alpha 1 and 2 domains of Class I MHC using a hidden Markov model built from a multiple sequence alignment containing alpha 1 and 2 domains of Class I MHC sequences. You can refer to the trim_mhc function in seq_utils.py to see how the processing is performed.

Best, Rui

andreas-wilm commented 9 months ago

Oh, it's all in ./scripts! Wonderful. Thank you very much!