sul-dlss-labs / rialto-airflow

Airflow for harvesting data for research intelligence and open access analysis
Apache License 2.0
1 stars 0 forks source link

Add new tasks create_doi_sunet and contribs #67

Closed edsu closed 4 months ago

edsu commented 4 months ago

Once the initial DOI collection process is complete we know the population of DOIs we are working with in the dataset. We are also able to map the DOI to a SUNETID using either the orcidid (for Dimensions and OpenAlex) or cap_profile_id (for sul_pub).

The new doi_sunet task will create a mapping of doi -> [sunetid] using the pickle files, sul_pub csv and the authors csv. This is then used by the doi_set task to generate the list of DOIs needed for harvesting.

Once the publications datasets are merged the new contribs task uses the doi_sunet mapping to add the sunetid column, split out the publications into contributions where each row has a unique sunetid. Finally the contributions are joined with the authors.csv.