microbiomedata / mixs-6-2-release-candidate

Proposed, Harmonized MIxS 6.2
https://github.com/GenomicsStandardsConsortium/mixs6.2_release_candidate
MIT License
5 stars 0 forks source link

DTM text mining will require ongoing curation #68

Open turbomam opened 1 year ago

turbomam commented 1 year ago

Especially config/curated_slot_notes_by_text_mining.tsv

text_mining_results/mixs_v6_repaired_term_title_token_matrix.tsv: config/curated_slot_notes_by_text_mining.tsv \
generated_schema/GSC_MIxS_6.yaml schemasheets_to_usage/GSC_MIxS_6_concise_usage.tsv
    $(RUN) add_notes_from_text_mining \
        --dtm-input-slot title \
        --input-col-vals-file text_mining_results/mixs_v6_repaired_term_title_token_list.tsv \
        --input-dtm-notes-mapping $(word 1,$^) \
        --input-schema-file $(word 2,$^) \
        --input-usage-report $(word 3,$^) \
        --output-schema-file generated_schema/GSC_MIxS_6.yaml.notated.yaml \
        --dtm-output $@ 
turbomam commented 1 year ago

There are probably better ways to extract topics with advanced tokenizastion and vectorization and thesaurus lookup