open-reaction-database / ord-data

Official data repository for the Open Reaction Database
https://open-reaction-database.org
Creative Commons Attribution Share Alike 4.0 International
219 stars 55 forks source link

Add USPTO grants #58

Closed skearnes closed 3 years ago

skearnes commented 3 years ago

Parsed from CML and sharded by year:

for year in ~/uspto/grants/*; do
  YEAR="$(basename "${year}")"
  python ord_schema/scripts/parse_uspto.py \
    --input_pattern="${HOME}/uspto/grants/${YEAR}/*.xml" \
    --name="uspto-grants-${YEAR}" \
    --output="../ord-data/grants-${YEAR}.pb" \
    --n_jobs=-1
done
skearnes commented 3 years ago

Arg; actions is OOM-ing. Might have to submit in smaller chunks.

skearnes commented 3 years ago

FYI I decided to go with gzip compression after all---it reduces the size/time of the clone by 80% and conserves our Git LFS bandwidth quota (also 50 GB/mo.)

skearnes commented 3 years ago

Currently blocked by https://github.com/open-reaction-database/ord-schema/pull/551 and https://github.com/open-reaction-database/ord-schema/pull/552.

skearnes commented 3 years ago

(I changed the target branch, but I'm not going to re-run the tests since it took almost 6 hours.)