repseqio / library-imgt

IMGT segment library converted to RepSeq.IO JSON format
12 stars 5 forks source link

Add Rhesus TRG #11

Closed bbimber closed 2 years ago

bbimber commented 2 years ago

@dbolotin @PoslavskySV: IMGT recently released TRG for rhesus macaque. I think supporting download of these data from the IMGT looks pretty simple - does this seem like a reasonable addition? I copied the anchor points from TRBV. In human, TRBV/TRVG were identical, so i assume that's just how IMGT formats them.

bbimber commented 2 years ago

Also, have you considered rules to parse the constant genes? The data are available:

http://www.imgt.org/genedb/GENElect?query=7.14+TRAC&species=Macaca+mulatta http://www.imgt.org/genedb/GENElect?query=7.14+TRBC&species=Macaca+mulatta http://www.imgt.org/genedb/GENElect?query=7.14+TRDC&species=Macaca+mulatta

The one gotcha appears to be that repseqIo fromPaddedFasta does not like passed FASTAs with multiple entries per gene. Since we really only care about EX1, perhaps their API would support another filter?

http://www.imgt.org/genedb/GENElect?query=7.14+TRAC&species=Macaca+mulatta&IMGTlabel=EX1

"IMGTlabel=EX1" works on other IMGT actions, but not this one. I cant find documentation on their APIs or another functional query page to inspect to discover whether we can filter on feature here. Have you tried anything like this before?

bbimber commented 2 years ago

for example (not completely working)

{
  "taxonId": 9544,
  "speciesNames": [
    "rhesus_monkey",
    "macaca_mulatta"
  ],
  "rules": [
    {
      "ruleType": "import",
      "output": "output/rhesus_monkey_C_TRA",
      "geneType": "C",
      "chain": "TRA",
      "anchorPoints": [
        {
          "point": "CBegin",
          "position": 0
        },
        {
          "point": "CExon1End",
          "position": -1
        }
      ],
      "sources": [
        "http://www.imgt.org/genedb/GENElect?query=7.14+TRAC&species=Macaca+mulatta&IMGTlabel=EX1"
      ]
    }       
  ]
}