Closed agitter closed 7 years ago
@agitter I'm on vacation until May 16. It's unlikely I'll be able to get to this until then. But here are the options:
Option 3 will give best results. If you want to add more metadata for URLs, I'd begin option 3 in my absence.
Thanks @dhimmel. I like the idea of 2 but expect it will slow us down. We may go with 1 so that we don't delay the submission and can explore 3 if time permits.
I'll leave the issue open to remind me to work on 3 if I can.
Cross-referencing #387, a book with a URL citation.
Greycite is working at this time. This API call for @url:http://papers.nips.cc/paper/5954-convolutional-networks-on-graphs-for-learning-molecular-fingerprints
returns:
{
"URL":"http://papers.nips.cc/paper/5954-convolutional-networks-on-graphs-for-learning-molecular-fingerprints",
"title":"Convolutional Networks on Graphs for Learning Molecular Fingerprints",
"issued":{
"date-parts":[
[
2015
]
]
},
"author":[
{
"family":"Duvenaud",
"given":"David K."
},
{
"family":"Maclaurin",
"given":"Dougal"
},
{
"family":"Iparraguirre",
"given":"Jorge"
},
{
"family":"Bombarell",
"given":"Rafael"
},
{
"family":"Hirzel",
"given":"Timothy"
},
{
"family":"Aspuru-Guzik",
"given":"Alan"
},
{
"family":"Adams",
"given":"Ryan P."
}
],
"greycite-status":"Scanned",
"greycite-scanned":"2017-05-09 00:03:57"
}
I'm hoping this is valid JSON CSL. If so, we should be able to use Greycite to help fill in URL references.
I'm going to look into adding Greycite support now.
Awesome. It looks like it does reasonable things for all URLs, e.g. this wiki page, in which case we wouldn't have to separate conference paper URLs from other URLs.
Greycite support was implemented in https://github.com/greenelab/deep-review/issues/464. In some cases, we may still want to manually override the Greycite metadata. Let's defer that decision.
@dhimmel Does the arxiv2bib
package provide a way for us to include the arXiv id or some other identifier in the reference list? Or can we retain that from the original reference? These entries feel incomplete without something that denotes them as arXiv preprints.
These entries feel incomplete without something that denotes them as arXiv preprints.
Totally agree. There's no unique ID as part of these references!
Does the arxiv2bib package provide a way for us to include the arXiv id or some other identifier in the reference list?
In https://github.com/greenelab/deep-review/pull/465, I fixed the arxiv2bib
bibtex to include a url
attribute. So now, bibliography.json
includes the URL for arXiv records like:
{
"URL": "http://arxiv.org/abs/1403.1347v1",
"abstract": "Predicting protein secondary structure is a fundamental problem in protein structure prediction. Here we present a new supervised generative stochastic network (GSN) based method to predict local secondary structure with deep hierarchical representations. GSN is a recently proposed deep learning technique (Bengio & Thibodeau-Laufer, 2013) to globally train deep generative model. We present the supervised extension of GSN, which learns a Markov chain to sample from a conditional distribution, and applied it to protein structure prediction. To scale the model to full-sized, high-dimensional data, like protein sequences with hundreds of amino acids, we introduce a convolutional architecture, which allows efficient learning across multiple layers of hierarchical representations. Our architecture uniquely focuses on predicting structured low-level labels informed with both low and high-level representations learned by the model. In our application this corresponds to labeling the secondary structure state of each amino-acid residue. We trained and tested the model on separate sets of non-homologous proteins sharing less than 30dataset, better than the previously reported best performance 64.9al., 2011) for this challenging secondary structure prediction problem.",
"author": [
{
"family": "Zhou",
"given": "Jian"
},
{
"family": "Troyanskaya",
"given": "Olga G."
}
],
"id": "8t43CQ9m",
"issued": {
"date-parts": [
[
2014,
3
]
]
},
"title": "Deep supervised and convolutional generative stochastic network for protein secondary structure prediction",
"type": "article-journal"
},
So the URL is there, just it's not getting added to the references. This behavior is up to our CSL. We're using the CSL for Journal of the Royal Society Interface.
So we could create a modified CSL? For example, I think every title should be a hyperlink to the work. This would solve any ambiguity issues.
Obviously, an upstream change would be preferred but that may not be feasible. Will look into it
Hmm, I'll think about a modified CSL. We may want to consult with @cgreene once he's available again.
A hack I've used with other CSLs is to provide arXiv:1403.1347
(for example) as the journal name. It's not technically correct, but it gets the arXiv id to show up.
@dhimmel you mentioned above
In some cases, we may still want to manually override the Greycite metadata.
What is your suggestion for manual overrides if we have a few misformatted references?
What is your suggestion for manual overrides if we have a few misformatted references?
What we'd do is create a file with CSL JSON with the correct details. Here's an example of the JSON for a single reference:
{
"URL": "http://arxiv.org/abs/1512.03542v1",
"abstract": "Exponential growth in Electronic Healthcare Records (EHR) has resulted in new opportunities and urgent needs for discovery of meaningful data-driven representations and patterns of diseases in Computational Phenotyping research. Deep Learning models have shown superior performance for robust prediction in computational phenotyping tasks, but suffer from the issue of model interpretability which is crucial for clinicians involved in decision-making. In this paper, we introduce a novel knowledge-distillation approach called Interpretable Mimic Learning, to learn interpretable phenotype features for making robust prediction while mimicking the performance of deep learning models. Our framework uses Gradient Boosting Trees to learn interpretable features from deep learning models such as Stacked Denoising Autoencoder and Long Short-Term Memory. Exhaustive experiments on a real-world clinical time-series dataset show that our method obtains similar or better performance than the deep learning models, and it provides interpretable phenotypes for clinical decision making.",
"author": [
{
"family": "Che",
"given": "Zhengping"
},
{
"family": "Purushotham",
"given": "Sanjay"
},
{
"family": "Khemani",
"given": "Robinder"
},
{
"family": "Liu",
"given": "Yan"
}
],
"id": "14DAmZTDg",
"issued": {
"date-parts": [
[
2015,
12
]
]
},
"title": "Distilling knowledge from deep networks with applications to healthcare domain",
"type": "article-journal"
}
Of course, abstract
can be omitted and other fields can be added.
It looks like my reference manager does a good job of exporting to CSL JSON so this shouldn't be too hard.
[
{
"type": "article-journal",
"title": "Deep learning as an opportunity in virtual screening",
"container-title": "Advances in neural information processing systems",
"volume": "27",
"source": "Google Scholar",
"URL": "http://www.bioinf.at/publications/2014/NIPS2014a.pdf",
"author": [
{
"family": "Unterthiner",
"given": "Thomas"
},
{
"family": "Mayr",
"given": "Andreas"
},
{
"family": "Klambauer",
"given": "Günter"
},
{
"family": "Steijaert",
"given": "Marvin"
},
{
"family": "Wegner",
"given": "Jörg K."
},
{
"family": "Ceulemans",
"given": "Hugo"
},
{
"family": "Hochreiter",
"given": "Sepp"
}
],
"issued": {
"date-parts": [
[
"2014"
]
]
},
"accessed": {
"date-parts": [
[
"2017",
5,
19
]
]
}
}
]
@dhimmel How would I include this JSON file into the build process? My export gives some id
attribute specific to me.
I'm also curious about what we would like to do with the arXiv ids. Would that require a modified CSL or would sticking them in a different field be acceptable?
@dhimmel these are the references that could use manual updates that I noticed during my proofreading. Some could be ignored (maybe all with DOIs), but some are missing important information:
If I create JSON files, should I use the canonical citation (e.g. URL) as the id attribute so that you could map it to the citation_id
?
Regarding arXiv, would bibtex_passthrough
in citations.py
be the best place to add the arXiv id to a bibtex field that will be used by the CSL?
@agitter we will need to manually override the CSL. For records that you provide JSON for, we can either:
What do you think makes sense?
@agitter you will produce a CSL JSON file of the correct metadata. Then you will have to add a citation_id
field to each record. These should be in the format of the standard_citation
column of processed-citations.tsv
. The build process will pop this citation_id
field and use it to fill the id
field with the hash.
In press
occurs, I believe, when there is no date. Add a date and in press should go away.
Regarding arXiv, would bibtex_passthrough in citations.py be the best place to add the arXiv id to a bibtex field that will be used by the CSL?
This should be done automatically. Let me deal with it.
@dhimmel I think completely replacing the CSL data with the new record makes sense to me. I can manually decide whether to copy the existing CSL record and update the broken parts or ignore it and export a record from Zotero. If you agree, I'll get started on a new JSON file. Should I make a pull request for a manual-citations.json
to the references branch?
I can match the standard_citation
of processed-citations.tsv
.
Will I be able to see the HTML or PDF outputs that are built by Travis before we merge? Or will I have to build locally to see if my changes work as expected?
@dhimmel added URLs to all references and we manually edited all major errors in the reference list.
We've be using URL references for two purposes: 1) Blog posts, online news articles, etc. 2) Conference papers
In both cases, do we need to add some additional information to the references? Case 2 is the greater concern. I'm wondering whether we need to start working on a local BibTeX file to resolve these.