krassowski / multi-omics-state-of-the-field

Analyses for "State of the field in multi-omics research: from computational needs to data mining and sharing"
https://doi.org/10.3389/fgene.2020.610798
MIT License
24 stars 13 forks source link

Classify into topics "Application of a method", "Introduction of a method" #7

Closed krassowski closed 4 years ago

krassowski commented 4 years ago

We can use the already labelled methods to generate a list of topics (i.e. words with probabilities) for articles which:

This can be done with LDA on titles + abstracts and sometimes on the full texts too. Importantly this is not a hard clustering exercise, as a publication can both introduce a new method, as well as apply it to a new dataset.

We need to have a training set. We could use tags from publishers and also manually curate a small set of method and application papers, for example drawing from the https://github.com/mikelove/awesome-multi-omics

krassowski commented 4 years ago

Also, the journal name could be included in the model, as we have a higher chance of a new method being published in e.g. "Bioinformatics" than in others.

krassowski commented 4 years ago

Also, some DOIs may carry "software" label (or is it ZENODO specific?)

biswapriyamisra commented 4 years ago

Also, some DOIs may carry "software" label (or is it ZENODO specific?)

biswapriyamisra commented 4 years ago

Just trying to understand: So, will we be able to include: Dryad, OSF, (if any in those), CRAN, "those hosted in authors own lab website/ academic institution (i.e., some people do!), Python repositories, GitHub, and we will ignore those for/ from 'Matlab' etc. ?

krassowski commented 4 years ago

Not sure if I understood, but I will elaborate on what is in scope: I am starting from PubMed, so only things which are published; they may also be on GitHub/CRAN/PyPI etc, but it will not be inclusive of everything that is there (only the published ones).

I can also include preprints from bio archive, but I would refrain from specifically screening GitHub/CRAN etc for the purpose of this work - there might be too much noise there. Or we could do a very simple search just to get an idea of the numbers, but not delve deep into the details (again, afraid of too much noise which requires manual curation)

biswapriyamisra commented 4 years ago

"Not sure if I understood," You understood it perfect!! Well, I was meaning the same, as "papers on hand (those more than 3000 etc) should lead us to GitHub/CRAN/PyPI / Dryad/Zenodo/OSF etc" and NOT the other way round- from repositories to papers! : ) So, you are right - we need to avoid noise and manual curation! .

krassowski commented 4 years ago

Great! I created a new issue to track this as a sub-task (and then we can use the results to support classification into method/application as in here).