pombase / pombase-chado

PomBase code for accessing Chado
MIT License
5 stars 3 forks source link

Training datasets for ML/AI for named entity recognition #1182

Open ValWood opened 3 months ago

ValWood commented 3 months ago

We provide a lexica of gene names in data downloads, but we don't have a list of all alleles

probably

gene allele allele-synonyms description would be useful, since not all alleles match the primary name

kimrutherford commented 3 months ago

We provide a lexica of gene names in data downloads, but we don't have a list of all alleles

We have this: https://curation.pombase.org/dumps/builds/pombase-build-2024-06-20/misc/all_alleles.tsv

Just need to add synonyms.

kimrutherford commented 3 months ago

Just need to add synonyms.

That's done for the morning.

We also have the same thing in JSON format: https://curation.pombase.org/dumps/latest_build/misc/allele_summaries.json

ValWood commented 3 months ago

OK, are these linked from our data downloads section. Maybe we should create. new section. Training datasets for AI/ML/textmining (it could include most datasets but I could add text describing how each file could be used)

kimrutherford commented 2 months ago

OK, are these linked from our data downloads section.

We haven't got a link. We don't have many links to individual files from the downloads page. Mostly links to directories.

Maybe we should create. new section. Training datasets for AI/ML/textmining

We should include our new directory: https://www.pombase.org/public_releases/pombase-2024-06-01/training_data_for_ML_and_AI/ once it's available.

The all_alleles.tsv file is (or will be) available as part of our new release directories: https://www.pombase.org/public_releases/pombase-2024-06-01/phenotypes_and_genotypes/ although I'd like a better name for the file.

We should probably link to the alleles file in the new release directory structure.

ValWood commented 2 months ago

For the ML data we can put a link on the website once we have consolidated the comments file a bit and included that.

Maybe we could include the alleles file in the phenotypes director and call it phenotypes_alleles or similar?

kimrutherford commented 2 months ago

Maybe we could include the alleles file in the phenotypes director and call it phenotypes_alleles or similar?

OK, I've renamed it to phenotype_alleles.tsv

https://www.pombase.org/public_releases/pombase-2024-06-01/phenotypes_and_genotypes/

ValWood commented 2 months ago

Sorry about that, I meant the directory name, but we already changed that to phenotypes _and_genetypes, which is better. We should keep this file name as "alleles.tsv" , I think.

kimrutherford commented 2 months ago

We should keep this file name as "alleles.tsv" , I think.

OK. I've made that change.