open-reaction-database / ord-data

Official data repository for the Open Reaction Database
https://open-reaction-database.org
Creative Commons Attribution Share Alike 4.0 International
205 stars 51 forks source link

Problem with `de/ord_dataset-de0979205c84441190feef587fef8d6d.pb.gz` #178

Open FanwangM opened 9 months ago

FanwangM commented 9 months ago

It was noticed for this reaction, ord-10d1b84d8d9b4252ad41370e7370c400 in this protobuf file, both the reactans and products have the same SMILES strings, OCCCCO and CCCCCCCCCCCCCCCCCCCCCCCCCCCC(=O)O. This dataset was merged into ORD from https://github.com/open-reaction-database/ord-data/pull/145.

In addition, the corresponding reaction_id, ord-10d1b84d8d9b4252ad41370e7370c400, cannot identify any reactions with the web searching interface.

FanwangM commented 9 months ago

The reaction protobuf message,

identifiers {
  type: REACTION_SMILES
  value: "[CH2:1]([CH2:2][CH2:3][CH2:4][OH:5])[OH:6].[CH3:7][CH2:8][CH2:9][CH2:10][CH2:11][CH2:12][CH2:13][CH2:14][CH2:15][CH2:16][CH2:17][CH2:18][CH2:19][CH2:20][CH2:21][CH2:22][CH2:23][CH2:24][CH2:25][CH2:26][CH2:27][CH2:28][CH2:29][CH2:30][CH2:31][CH2:32][CH2:33][C:34]([OH:35])=[O:36].[Sn:37]>>[CH2:1]([CH2:2][CH2:3][CH2:4][OH:5])[OH:6].[CH3:7][CH2:8][CH2:9][CH2:10][CH2:11][CH2:12][CH2:13][CH2:14][CH2:15][CH2:16][CH2:17][CH2:18][CH2:19][CH2:20][CH2:21][CH2:22][CH2:23][CH2:24][CH2:25][CH2:26][CH2:27][CH2:28][CH2:29][CH2:30][CH2:31][CH2:32][CH2:33][C:34](=[O:35])[OH:36]"
  is_mapped: true
}
inputs {
  key: "from_reaction_smiles"
  value {
    components {
      identifiers {
        type: SMILES
        details: "Extracted from reaction SMILES"
        value: "OCCCCO"
      }
      amount {
        unmeasured {
          type: CUSTOM
          details: "Extracted from reaction SMILES"
        }
      }
      reaction_role: REACTANT
    }
    components {
      identifiers {
        type: SMILES
        details: "Extracted from reaction SMILES"
        value: "CCCCCCCCCCCCCCCCCCCCCCCCCCCC(=O)O"
      }
      amount {
        unmeasured {
          type: CUSTOM
          details: "Extracted from reaction SMILES"
        }
      }
      reaction_role: REACTANT
    }
    components {
      identifiers {
        type: SMILES
        details: "Extracted from reaction SMILES"
        value: "[Sn]"
      }
      amount {
        unmeasured {
          type: CUSTOM
          details: "Extracted from reaction SMILES"
        }
      }
      reaction_role: REACTANT
    }
  }
}
outcomes {
  products {
    identifiers {
      type: SMILES
      details: "Extracted from reaction SMILES"
      value: "OCCCCO"
    }
    reaction_role: PRODUCT
  }
  products {
    identifiers {
      type: SMILES
      details: "Extracted from reaction SMILES"
      value: "CCCCCCCCCCCCCCCCCCCCCCCCCCCC(=O)O"
    }
    reaction_role: PRODUCT
  }
}
provenance {
  doi: "10.1039/C8SC04228D"
  record_created {
    time {
      value: "02/17/2021, 15:22:00"
    }
    person {
      name: "Steven Kearnes"
      orcid: "0000-0003-4579-4388"
      organization: "Google LLC"
      email: "kearnes@google.com"
    }
  }
  record_modified {
    time {
      value: "Wed Feb 17 23:08:03 2021"
    }
    person {
      username: "github-actions"
      email: "github-actions@github.com"
    }
    details: "Automatic updates from the submission pipeline."
  }
  record_modified {
    time {
      value: "2021-11-05 15:37:37.034278"
    }
    person {
      name: "Steven Kearnes"
      orcid: "0000-0003-4579-4388"
      organization: "Relay Therapeutics"
      email: "skearnes@relaytx.com"
    }
    details: "Add inputs/outcomes from reaction SMILES; see https://github.com/open-reaction-database/ord-schema/issues/617."
  }
}
reaction_id: "ord-10d1b84d8d9b4252ad41370e7370c400"
skearnes commented 9 months ago

This isn't showing up in the ORM; I need to check why.

skearnes commented 9 months ago

I'm also planning to set is_mined for these datasets.

bdeadman commented 3 weeks ago

When building the ORD postgresql database using the ord-schema ORM, the script stalls (up to 1 h tested) on ord_dataset-de0979205c84441190feef587fef8d6d. I was able to skip this dataset and continue with ord_dataset-de10a15943a54ac7a3c3e1c774d21392 using exit().