nextstrain / ingest

Shared internal tooling for pathogen data ingest. Used by our pathogen build repos.
1 stars 0 forks source link

Add duplicated scripts from pathogen repos #1

Closed jameshadfield closed 11 months ago

jameshadfield commented 1 year ago

The first step in making this repository useful is to populate it with scripts that are currently manually copied around pathogen repos.

See shared GDoc for additional context and details on scripts.

Progress

This was originally created by @joverlee521 in https://github.com/nextstrain/ingest/issues/1#issuecomment-1636328472.

Identical scripts (added in #6)

Diverged scripts with various different versions used across workflows (binned into related groups):

Simple notify scripts (added in #8)

S3 interaction + notify scripts that depend on S3 files (added in #12)

Genbank interactions

Nextclade joining

Potential augur curate scripts

Summary of differences

This is the original issue text from @jameshadfield.

Here's a quick scan of duplicated ingest scripts, using monkeypox as the "base", against 4 other ingest script directories:

Directories of scripts considered:

mpx       # monkeypox/ingest/bin at a1f0d7b
hbv       # hepatitisB/ingest/scripts at 1cdd197
rsv       # rsv/ingest/bin at ba171f4
dengue    # dengue/ingest/bin branch: new_ingest @ 247b2fd 
ncov      # ncov-ingest/bin at 88fddbe

Note that when there's only 1-3 lines different that's often just an added comment to indicate where the script's been copied from

mpx/apply-geolocation-rules
        rsv/apply-geolocation-rules     IDENTICAL
        hbv/apply-geolocation-rules.py      17 lines different
        dengue/apply-geolocation-rules  IDENTICAL
mpx/cloudfront-invalidate
        rsv/cloudfront-invalidate       IDENTICAL
        dengue/cloudfront-invalidate    IDENTICAL
        ncov/cloudfront-invalidate      IDENTICAL
mpx/csv-to-ndjson
        rsv/csv-to-ndjson.py      16 lines different
        dengue/csv-to-ndjson    IDENTICAL
        ncov/csv-to-ndjson       3 lines different
mpx/download-from-s3
        dengue/download-from-s3       2 lines different
        ncov/download-from-s3       8 lines different
mpx/fasta-to-ndjson
        rsv/fasta-to-ndjson     IDENTICAL
        dengue/fasta-to-ndjson  IDENTICAL
mpx/fetch-from-genbank
        dengue/fetch-from-genbank       1 lines different
mpx/genbank-url
        rsv/genbank-url      42 lines different
        dengue/genbank-url      11 lines different
mpx/join-metadata-and-clades.py
        rsv/join-metadata-and-clades.py       3 lines different
        dengue/join-metadata-and-clades.py      IDENTICAL
        ncov/join-metadata-and-clades     114 lines different
mpx/merge-user-metadata
        rsv/merge-user-metadata IDENTICAL
        dengue/merge-user-metadata      IDENTICAL
mpx/ndjson-to-tsv-and-fasta
        rsv/ndjson-to-tsv-and-fasta     IDENTICAL
        dengue/ndjson-to-tsv-and-fasta  IDENTICAL
mpx/notify-on-diff
        dengue/notify-on-diff   IDENTICAL
mpx/notify-on-job-fail
        rsv/notify-on-job-fail       1 lines different
        dengue/notify-on-job-fail       1 lines different
        ncov/notify-on-job-fail      10 lines different
mpx/notify-on-job-start
        rsv/notify-on-job-start       3 lines different
        dengue/notify-on-job-start       3 lines different
        ncov/notify-on-job-start      30 lines different
mpx/notify-on-record-change
        rsv/notify-on-record-change       3 lines different
        dengue/notify-on-record-change       3 lines different
        ncov/notify-on-record-change       6 lines different
mpx/notify-slack
        rsv/notify-slack      15 lines different
        dengue/notify-slack     IDENTICAL
        ncov/notify-slack      16 lines different
mpx/reverse_reversed_sequences.py
        dengue/reverse_reversed_sequences.py    IDENTICAL
mpx/s3-object-exists
        rsv/s3-object-exists    IDENTICAL
        dengue/s3-object-exists IDENTICAL
        ncov/s3-object-exists       1 lines different
mpx/sha256sum
        rsv/sha256sum   IDENTICAL
        dengue/sha256sum        IDENTICAL
        ncov/sha256sum       1 lines different
mpx/transform-authors
        rsv/transform-authors   IDENTICAL
        dengue/transform-authors        IDENTICAL
mpx/transform-date-fields
        rsv/transform-date-fields       IDENTICAL
        dengue/transform-date-fields    IDENTICAL
mpx/transform-field-names
        rsv/transform-field-names       IDENTICAL
        dengue/transform-field-names    IDENTICAL
mpx/transform-genbank-location
        rsv/transform-genbank-location  IDENTICAL
        dengue/transform-genbank-location       IDENTICAL
mpx/transform-strain-names
        rsv/transform-strain-names       1 lines different
        dengue/transform-strain-names   IDENTICAL
mpx/transform-string-fields
        rsv/transform-string-fields     IDENTICAL
        dengue/transform-string-fields  IDENTICAL
mpx/trigger
        dengue/trigger  IDENTICAL
        ncov/trigger    IDENTICAL
mpx/trigger-on-new-data
        dengue/trigger-on-new-data       1 lines different
        ncov/trigger-on-new-data       6 lines different
mpx/upload-to-s3
        rsv/upload-to-s3       3 lines different
        dengue/upload-to-s3       3 lines different
        ncov/upload-to-s3       1 lines different
joverlee521 commented 1 year ago

See shared GDoc for additional context and details on scripts.


Checklist of scripts that need to be added for me to keep track of progress:

Identical scripts (added in #6)

Diverged scripts with various different versions used across workflows (binned into related groups):

Simple notify scripts (added in #8)

S3 interaction + notify scripts that depend on S3 files (Jover's WIP branch)

Genbank interactions

Nextclade joining

Potential augur curate scripts

victorlin commented 1 year ago

@joverlee521 thanks for making the checklist in the comment above! It'll be useful to have it continually updated. To make that easier, I've moved it to the main issue text.

joverlee521 commented 1 year ago

In talking through #20 with @j23414, we realized that join-metadata-and-clades can mostly be replaced with a couple of csvtk commands (csvtk cut | csvtk rename2 | csvtk join).

The version of the script in ncov-ingest adds clock_deviation, but that can be done separately from the joining.

joverlee521 commented 11 months ago

Closing issue as we have resolved all of the listed duplicate scripts. Any other additions can be opened as separate issues in the future.