Closed krassowski closed 4 years ago
Also, the journal name could be included in the model, as we have a higher chance of a new method being published in e.g. "Bioinformatics" than in others.
Also, some DOIs may carry "software" label (or is it ZENODO specific?)
Also, some DOIs may carry "software" label (or is it ZENODO specific?)
Just trying to understand: So, will we be able to include: Dryad, OSF, (if any in those), CRAN, "those hosted in authors own lab website/ academic institution (i.e., some people do!), Python repositories, GitHub, and we will ignore those for/ from 'Matlab' etc. ?
Not sure if I understood, but I will elaborate on what is in scope: I am starting from PubMed, so only things which are published; they may also be on GitHub/CRAN/PyPI etc, but it will not be inclusive of everything that is there (only the published ones).
I can also include preprints from bio archive, but I would refrain from specifically screening GitHub/CRAN etc for the purpose of this work - there might be too much noise there. Or we could do a very simple search just to get an idea of the numbers, but not delve deep into the details (again, afraid of too much noise which requires manual curation)
"Not sure if I understood," You understood it perfect!! Well, I was meaning the same, as "papers on hand (those more than 3000 etc) should lead us to GitHub/CRAN/PyPI / Dryad/Zenodo/OSF etc" and NOT the other way round- from repositories to papers! : ) So, you are right - we need to avoid noise and manual curation! .
Great! I created a new issue to track this as a sub-task (and then we can use the results to support classification into method/application as in here).
We can use the already labelled methods to generate a list of topics (i.e. words with probabilities) for articles which:
This can be done with LDA on titles + abstracts and sometimes on the full texts too. Importantly this is not a hard clustering exercise, as a publication can both introduce a new method, as well as apply it to a new dataset.
We need to have a training set. We could use tags from publishers and also manually curate a small set of method and application papers, for example drawing from the https://github.com/mikelove/awesome-multi-omics