URLs in references - Githubissues

agitter commented 7 years ago

We've be using URL references for two purposes: 1) Blog posts, online news articles, etc. 2) Conference papers

In both cases, do we need to add some additional information to the references? Case 2 is the greater concern. I'm wondering whether we need to start working on a local BibTeX file to resolve these.

dhimmel commented 7 years ago

@agitter I'm on vacation until May 16. It's unlikely I'll be able to get to this until then. But here are the options:

Leave as is. The journal can fill in the URL references at a later time.
Look into automatic methods for URL citation metadata extraction. There is something called Greycite that may be able to do this. However, it's uptime and reliability is questionable.
Manually add metadata (either as CSL JSON or bibtex) for the URLs. This may be good if the articles are in Google Scholar (although those bibtex records often have errors).

Option 3 will give best results. If you want to add more metadata for URLs, I'd begin option 3 in my absence.

agitter commented 7 years ago

Thanks @dhimmel. I like the idea of 2 but expect it will slow us down. We may go with 1 so that we don't delay the submission and can explore 3 if time permits.

I'll leave the issue open to remind me to work on 3 if I can.

agitter commented 7 years ago

Cross-referencing #387, a book with a URL citation.

dhimmel commented 7 years ago

Greycite is working at this time. This API call for @url:http://papers.nips.cc/paper/5954-convolutional-networks-on-graphs-for-learning-molecular-fingerprints returns:

{
   "URL":"http://papers.nips.cc/paper/5954-convolutional-networks-on-graphs-for-learning-molecular-fingerprints",
   "title":"Convolutional Networks on Graphs for Learning Molecular Fingerprints",
   "issued":{
      "date-parts":[
         [
            2015
         ]
      ]
   },
   "author":[
      {
         "family":"Duvenaud",
         "given":"David K."
      },
      {
         "family":"Maclaurin",
         "given":"Dougal"
      },
      {
         "family":"Iparraguirre",
         "given":"Jorge"
      },
      {
         "family":"Bombarell",
         "given":"Rafael"
      },
      {
         "family":"Hirzel",
         "given":"Timothy"
      },
      {
         "family":"Aspuru-Guzik",
         "given":"Alan"
      },
      {
         "family":"Adams",
         "given":"Ryan P."
      }
   ],
   "greycite-status":"Scanned",
   "greycite-scanned":"2017-05-09 00:03:57"
}

I'm hoping this is valid JSON CSL. If so, we should be able to use Greycite to help fill in URL references.

dhimmel commented 7 years ago

I'm going to look into adding Greycite support now.

agitter commented 7 years ago

Awesome. It looks like it does reasonable things for all URLs, e.g. this wiki page, in which case we wouldn't have to separate conference paper URLs from other URLs.

dhimmel commented 7 years ago

Greycite support was implemented in https://github.com/greenelab/deep-review/issues/464. In some cases, we may still want to manually override the Greycite metadata. Let's defer that decision.

agitter commented 7 years ago

@dhimmel Does the arxiv2bib package provide a way for us to include the arXiv id or some other identifier in the reference list? Or can we retain that from the original reference? These entries feel incomplete without something that denotes them as arXiv preprints.

dhimmel commented 7 years ago

These entries feel incomplete without something that denotes them as arXiv preprints.

Totally agree. There's no unique ID as part of these references!

Does the arxiv2bib package provide a way for us to include the arXiv id or some other identifier in the reference list?

In https://github.com/greenelab/deep-review/pull/465, I fixed the arxiv2bib bibtex to include a url attribute. So now, bibliography.json includes the URL for arXiv records like:

  {
    "URL": "http://arxiv.org/abs/1403.1347v1",
    "abstract": "Predicting protein secondary structure is a fundamental problem in protein structure prediction. Here we present a new supervised generative stochastic network (GSN) based method to predict local secondary structure with deep hierarchical representations. GSN is a recently proposed deep learning technique (Bengio & Thibodeau-Laufer, 2013) to globally train deep generative model. We present the supervised extension of GSN, which learns a Markov chain to sample from a conditional distribution, and applied it to protein structure prediction. To scale the model to full-sized, high-dimensional data, like protein sequences with hundreds of amino acids, we introduce a convolutional architecture, which allows efficient learning across multiple layers of hierarchical representations. Our architecture uniquely focuses on predicting structured low-level labels informed with both low and high-level representations learned by the model. In our application this corresponds to labeling the secondary structure state of each amino-acid residue. We trained and tested the model on separate sets of non-homologous proteins sharing less than 30dataset, better than the previously reported best performance 64.9al., 2011) for this challenging secondary structure prediction problem.",
    "author": [
      {
        "family": "Zhou",
        "given": "Jian"
      },
      {
        "family": "Troyanskaya",
        "given": "Olga G."
      }
    ],
    "id": "8t43CQ9m",
    "issued": {
      "date-parts": [
        [
          2014,
          3
        ]
      ]
    },
    "title": "Deep supervised and convolutional generative stochastic network for protein secondary structure prediction",
    "type": "article-journal"
  },

So the URL is there, just it's not getting added to the references. This behavior is up to our CSL. We're using the CSL for Journal of the Royal Society Interface.

So we could create a modified CSL? For example, I think every title should be a hyperlink to the work. This would solve any ambiguity issues.

Obviously, an upstream change would be preferred but that may not be feasible. Will look into it

agitter commented 7 years ago

Hmm, I'll think about a modified CSL. We may want to consult with @cgreene once he's available again.

A hack I've used with other CSLs is to provide arXiv:1403.1347 (for example) as the journal name. It's not technically correct, but it gets the arXiv id to show up.

agitter commented 7 years ago

@dhimmel you mentioned above

In some cases, we may still want to manually override the Greycite metadata.

What is your suggestion for manual overrides if we have a few misformatted references?

dhimmel commented 7 years ago

What is your suggestion for manual overrides if we have a few misformatted references?

What we'd do is create a file with CSL JSON with the correct details. Here's an example of the JSON for a single reference:

  {
    "URL": "http://arxiv.org/abs/1512.03542v1",
    "abstract": "Exponential growth in Electronic Healthcare Records (EHR) has resulted in new opportunities and urgent needs for discovery of meaningful data-driven representations and patterns of diseases in Computational Phenotyping research. Deep Learning models have shown superior performance for robust prediction in computational phenotyping tasks, but suffer from the issue of model interpretability which is crucial for clinicians involved in decision-making. In this paper, we introduce a novel knowledge-distillation approach called Interpretable Mimic Learning, to learn interpretable phenotype features for making robust prediction while mimicking the performance of deep learning models. Our framework uses Gradient Boosting Trees to learn interpretable features from deep learning models such as Stacked Denoising Autoencoder and Long Short-Term Memory. Exhaustive experiments on a real-world clinical time-series dataset show that our method obtains similar or better performance than the deep learning models, and it provides interpretable phenotypes for clinical decision making.",
    "author": [
      {
        "family": "Che",
        "given": "Zhengping"
      },
      {
        "family": "Purushotham",
        "given": "Sanjay"
      },
      {
        "family": "Khemani",
        "given": "Robinder"
      },
      {
        "family": "Liu",
        "given": "Yan"
      }
    ],
    "id": "14DAmZTDg",
    "issued": {
      "date-parts": [
        [
          2015,
          12
        ]
      ]
    },
    "title": "Distilling knowledge from deep networks with applications to healthcare domain",
    "type": "article-journal"
  }

Of course, abstract can be omitted and other fields can be added.

agitter commented 7 years ago

It looks like my reference manager does a good job of exporting to CSL JSON so this shouldn't be too hard.

[
 {
  "type": "article-journal",
  "title": "Deep learning as an opportunity in virtual screening",
  "container-title": "Advances in neural information processing systems",
  "volume": "27",
  "source": "Google Scholar",
  "URL": "http://www.bioinf.at/publications/2014/NIPS2014a.pdf",
  "author": [
   {
    "family": "Unterthiner",
    "given": "Thomas"
   },
   {
    "family": "Mayr",
    "given": "Andreas"
   },
   {
    "family": "Klambauer",
    "given": "Günter"
   },
   {
    "family": "Steijaert",
    "given": "Marvin"
   },
   {
    "family": "Wegner",
    "given": "Jörg K."
   },
   {
    "family": "Ceulemans",
    "given": "Hugo"
   },
   {
    "family": "Hochreiter",
    "given": "Sepp"
   }
  ],
  "issued": {
   "date-parts": [
    [
     "2014"
    ]
   ]
  },
  "accessed": {
   "date-parts": [
    [
     "2017",
     5,
     19
    ]
   ]
  }
 }
]

@dhimmel How would I include this JSON file into the build process? My export gives some id attribute specific to me.

I'm also curious about what we would like to do with the arXiv ids. Would that require a modified CSL or would sticking them in a different field be acceptable?

agitter commented 7 years ago

@dhimmel these are the references that could use manual updates that I noticed during my proofreading. Some could be ignored (maybe all with DOIs), but some are missing important information:

Wrong date: 2004 Academy of Management: Andrew S. Grove. See http://www.intel.com/pressroom/archive/speeches/ag080998.htm.
Wrong author and date: Dunnavant F. 2014 Print. See https://www.nigms.nih.gov/Education/Documents/curiosity.pdf.
Remove in press: In press. RR Results - CASP12. See http://www.predictioncenter.org/casp12/rrc_avrg_results.cgi.
Remove in press: In press. CAMEO - Continuous Automated Model Evaluation - Welcome. See http://www.cameo3d.org/.
Error: In press. RUL - Napaka / Error. See https://repozitorij.uni-lj.si/IzpisGradiva.php?id=85515.
Title and authors: 2016 OUP accepted manuscript. Briefings In Bioinformatics (doi:10.1093/bib/bbw110)
Special characters: VidoviÄ‡ D, Koleti A, SchÃ¼rer SC. 2014 Large-scale integration of small molecule-induced genome-wide transcriptional responses, Kinome-wide binding affinities and cell-growth inhibition profiles reveal global trends characterizing systems-level drug action. Frontiers in Genetics 5. (doi:10.3389/fgene.2014.00342)
No data: 2015 See http://www.bioinf.at/publications/2014/NIPS2014a.pdf.
Remove in press: deepchem. In press. deepchem/deepchem. GitHub. See https://github.com/deepchem/deepchem.
No data: 2017 See https://www.robots.ox.ac.uk/~vedaldi/assets/pubs/mahendran16salient.pdf.
Wrong authors: @googleresearch. 2015 Inceptionism: Going Deeper into Neural Networks. Research Blog. See http://googleresearch.blogspot.co.uk/2015/06/inceptionism-going-deeper-into-neural.html.
In press: In press. Visualizing Higher-Layer Features of a Deep Network - LISA - Publications - Aigaion 2.0. See http://www.iro.umontreal.ca/~lisa/publications2/index.php/publications/show/247.
No data: 2017 See https://openreview.net/pdf?id=Sk-oDY9ge.
No data: 2016 See https://papers.nips.cc/paper/5717-taming-the-wild-a-unified-analysis-of-hogwild-style-algorithms.pdf.
No data: 2015 See http://download.tensorflow.org/paper/whitepaper2015.pdf.
In press: fchollet. In press. fchollet/keras. GitHub. See https://github.com/fchollet/keras.
In press: maxpumperla. In press. maxpumperla/elephas. GitHub. See https://github.com/maxpumperla/elephas.
No data: 2014 See https://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization.pdf.
Need full URL: 2017 Sage Synapse : Contribute to the Cure. See https://www.synapse.org/.
Special characters: Pärnamaa T, Parts L. 2017 Accurate Classification of Protein Subcellular Localization from High-Throughput Microscopy Images Using Deep Learning. G3: Genes|Genomes|Genetics 7, 1385–1392. (doi:10.1534/g3.116.033654)
No data: 2011 See https://ai.stanford.edu/~ang/papers/icml11-MultimodalDeepLearning.pdf.
Author: @BIIntelligence. 2017 IBM edges closer to human speech recognition. Business Insider. See http://www.businessinsider.com/ibm-edges-closer-to-human-speech-recognition-2017-3.
Author and in press: greenelab. In press. greenelab/deep-review. GitHub. See https://github.com/greenelab/deep-review.
No data: 2014 See https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf.
Author: @. 2006 Data is the New Oil. ANA Marketing Maestros. See http://ana.blogs.com/maestros/2006/11/data_is_the_new.html.
Missing metadata: 2017 Weak Supervision. See http://hazyresearch.github.io/snorkel/blog/weak_supervision.html.
No data: 2017 See https://eprint.iacr.org/2017/281.pdf.

If I create JSON files, should I use the canonical citation (e.g. URL) as the id attribute so that you could map it to the citation_id?

Regarding arXiv, would bibtex_passthrough in citations.py be the best place to add the arXiv id to a bibtex field that will be used by the CSL?

dhimmel commented 7 years ago

@agitter we will need to manually override the CSL. For records that you provide JSON for, we can either:

completely replace the CSL data with the new record
update the CSL data so you would be able to fix problems without redoing all parts of the record.

What do you think makes sense?

@agitter you will produce a CSL JSON file of the correct metadata. Then you will have to add a citation_id field to each record. These should be in the format of the standard_citation column of processed-citations.tsv. The build process will pop this citation_id field and use it to fill the id field with the hash.

In press occurs, I believe, when there is no date. Add a date and in press should go away.

Regarding arXiv, would bibtex_passthrough in citations.py be the best place to add the arXiv id to a bibtex field that will be used by the CSL?

This should be done automatically. Let me deal with it.

agitter commented 7 years ago

@dhimmel I think completely replacing the CSL data with the new record makes sense to me. I can manually decide whether to copy the existing CSL record and update the broken parts or ignore it and export a record from Zotero. If you agree, I'll get started on a new JSON file. Should I make a pull request for a manual-citations.json to the references branch?

I can match the standard_citation of processed-citations.tsv.

Will I be able to see the HTML or PDF outputs that are built by Travis before we merge? Or will I have to build locally to see if my changes work as expected?

agitter commented 7 years ago

@dhimmel added URLs to all references and we manually edited all major errors in the reference list.

greenelab / deep-review

URLs in references #381