phiresky / pandoc-url2cite

Effortlessly and transparently add correctly styled citations to your markdown paper given only a URL
Other
122 stars 9 forks source link

feature_request(behavior): generation in Markdown directly #7

Closed Kristinita closed 3 years ago

Kristinita commented 3 years ago

1. Summary

It would be nice, if would be possible to generate citations from links in Markdown directly.

2. Example

Like your blogpost about pandoc-url2cite.

2.1. Input

Kira first[^1], Kira second[^2]

[^1]: https://papers.nips.cc/paper/5423-generative-adversarial-nets
[^2]: https://doi.org/10.5325/utopianstudies.28.3.0685

2.2. Output

Kira first[^1], Kira second[^2]

[^1]: **Supervised Learning of Probability Distributions by Neural Networks**
Eric Baum, Frank Wilczek *Neural Information Processing Systems* (1988) <https://proceedings.neurips.cc/paper/1987/file/eccbc87e4b5ce2fe28308fd9f2a7baf3-Paper.pdf>
[^2]: **Utopia for Realists: And How We Can Get There by Rutger Bregman (review)**
Bill Metcalf *Utopian Studies* (2017) <https://muse.jhu.edu/article/686666>

3. Argumentation

3.1. Productivity

Scientists and researchers should be doing a science, but not spending a lot of time on citations. It would be nice to simplify as much as possible routine work that can be automated. Ideally one CLI command must be required for transforming all links.

3.2. Journals requirements

I need to send articles in the format .doc or .docx, not .pdf to Russian scientific journals.

I use Markdown for writing any content. I think this is the most convenient format available, and it’s popular. Possibly, it would be better to generate bibliographic information in Markdown directly. Then, if needed, convert valid Markdown to other formats as .pdf or .doc using third-party tools.

3.3. Wide application

It may be necessary to make a high-quality bibliographic information not for scientific articles solely. Why shouldn’t we do this in any of our blog posts or other places that use Markdown? Why should we make articles of less quality on our personal sites?

Thanks.

phiresky commented 3 years ago

You can output the file in any format supported in pandoc. For example, the following work great:

I have added some output formats to the examples/ directory. For my blog article, I used a my own code that uses the pandoc JSON output directly outputs to React components to enable interactivity with the citations and other things. But I think mostly for websites just having hyperlinks works pretty well.

Markdown output also works, you need to add --to commonmark-raw_html. (Or --to markdown-citations to use get pandoc-flavored markdown output, or --to commonmark to get mixed markdown and html.

Looks like this on my examples/minimal.md:

# Introduction

The GAN was first introduced in \[1\].

# References

\[1\] I. Goodfellow *et al.*, “Generative Adversarial Nets,” in
*Advances in Neural Information Processing Systems 27*, Z. Ghahramani,
M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, Eds. Curran
Associates, Inc., 2014, pp. 2672–2680 \[Online\]. Available:
<http://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf>.
\[Accessed: 14-Dec-2019\]

\[2\] D. Kahneman, *Thinking, fast and slow*, 1st ed. New York, 2011
\[Online\]. Available: <https://www.worldcat.org/oclc/706020998>.
\[Accessed: 14-Dec-2019\]

\[3\] K. Muldoon, J. Towse, V. Simms, O. Perra, and V. Menzies, “A
longitudinal analysis of estimation, counting skills, and mathematical
ability across the first school year.” *Developmental Psychology*, vol.
49, no. 2, pp. 250–257, 2013 \[Online\]. Available:
<http://doi.apa.org/getdoi.cfm?doi=10.1037/a0028240>. \[Accessed:
14-Dec-2019\]

rendered:

Introduction

The GAN was first introduced in [1].

References

[1] I. Goodfellow et al., “Generative Adversarial Nets,” in Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2014, pp. 2672–2680 [Online]. Available: http://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf. [Accessed: 14-Dec-2019]

[2] D. Kahneman, Thinking, fast and slow, 1st ed. New York, 2011 [Online]. Available: https://www.worldcat.org/oclc/706020998. [Accessed: 14-Dec-2019]

[3] K. Muldoon, J. Towse, V. Simms, O. Perra, and V. Menzies, “A longitudinal analysis of estimation, counting skills, and mathematical ability across the first school year.” Developmental Psychology, vol. 49, no. 2, pp. 250–257, 2013 [Online]. Available: http://doi.apa.org/getdoi.cfm?doi=10.1037/a0028240. [Accessed: 14-Dec-2019]

Kristinita commented 3 years ago

Status: Not fixed :crying_cat_face:

1. Problem

Can you show an example of a CLI command that doesn’t modify anything in real Markdown files other than references? I haven’t been able to find this.

Also, I couldn’t find, how I can use pandoc filters solely, but don’t transform everything else.

2. Attempts

Markdown output also works, you need to add --to commonmark-raw_html. (Or --to markdown-citations to use get pandoc-flavored markdown output, or --to commonmark to get mixed markdown and html).

I've tried it all and also --to markdown, --to markdown_strict and --to markdown_mmd.

3. Expected behavior

All the commands I have entered will heavily modify real Markdown files. All it takes is to replace references:

[^1]: https://papers.nips.cc/paper/5423-generative-adversarial-nets
[^1]: **Supervised Learning of Probability Distributions by Neural Networks**
Eric Baum, Frank Wilczek *Neural Information Processing Systems* (1988) <https://proceedings.neurips.cc/paper/1987/file/eccbc87e4b5ce2fe28308fd9f2a7baf3-Paper.pdf>

And don’t touch anything else in the Markdown files.

4. Argumentation

4.1. Common cause

Compatibility.

4.2. Details

We are using different tools to convert Markdown to HTML. I use Pelican, another peoples may use another tools. Most of my code is a valid Markdown, but also I use specific extensions like PyMdown Extensions. Pandoc transforms my Markdown, and then Pelican compile code to the unwanted HTML.

5. Regex suggestion

Not ideal suggestion, but it’s better than nothing.

We can find the links with the regular expression:

(?!^\[\^\d+\]: )(https?:\/\/[^\s]+)$

Demonstration on Regex101.

the regular expression for footnotes links

We can replace the found links with the selected CSL style. It won’t affect everything else in the Markdown files.

Thanks.

phiresky commented 3 years ago

Revisiting this - I think this is out of scope for this project. pandoc-url2cite specifically a pandoc filter and as such transforms the whole document through the pandoc AST - which can reproduce the whole semantics of the markdown, but not the detailed formatting such as which type of headings are used, or the exact indentation of lists etc.

The main "magic" of this tool relies on the Zotero Extractors - they do the part of converting a URL to a BibTex Citation Object. Then the citation style generator within pandoc generates the actual formatted citation. url2cite sits in between, and as such cannot affect the pandoc citation output generation part (easily) - even outputting it as [^1]: Supervised Learning of Probability... isn't possible.

A tool that does what you want should be fairly easy to implement though just using the Zotero Extractors. If you want to preserve the exact existing formatting of your markdown files you probably need to use whatever parser you're using to create your AST since I don't know of any parser that preserves the exact formatting. Though I'd suggest to let go of wanting very specific non-semantically important formatting in your markdown files anyways and just always pass them through a formatter like prettier.

It would also be possible (and pretty easy) to extract the part of pandoc-url2cite into a standalone url2cite library that takes a link or list of links and converts them to CSL Json objects, which could then be used in arbitrary pipelines. That would basically just be a curl https://en.wikipedia.org/api/rest_v1/#/Citation/getCitation ... though.