karlicoss / promnesia

Another piece of your extended mind
https://beepb00p.xyz/promnesia.html
MIT License
1.75k stars 74 forks source link

Can I search ISSN, ISBN and DOI in a web-page, Not only URL? #280

Open hwiorn opened 2 years ago

hwiorn commented 2 years ago

I'm writing an indexer for org-roam and BibTeX to link between org-roam to web-browser.

Some org-file has citation syntax like below.

:PROPERTIES:
:ID:       120cf393-9ec3-40b8-a486-d903036236f8
:ROAM_REFS: cite:Dong2018
:END:
#+TITLE: Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition
#+CREATED: [2021-07-18 Sun 15:38]
#+filetags: :Literature:

- tags ::
- keywords ::
- author(s) :: Dong, Linhao and Xu, Shuang and Xu, Bo

The bib file would be like this.

@InProceedings{Dong2018,
  author     = {Dong, Linhao and Xu, Shuang and Xu, Bo},
  booktitle  = {2018 {IEEE} {International} {Conference} on {Acoustics}, {Speech} and {Signal} {Processing} ({ICASSP})},
  title      = {Speech-{Transformer}: {A} {No}-{Recurrence} {Sequence}-to-{Sequence} {Model} for {Speech} {Recognition}},
  year       = {2018},
  month      = apr,
  note       = {ZSCC: 0000311 ISSN: 2379-190X},
  pages      = {5884--5888},
  abstract   = {Recurrent sequence-to-sequence models using encoder-decoder architecture have made great progress in speech recognition task. However, they suffer from the drawback of slow training speed because the internal recurrence limits the training parallelization. In this paper, we present the Speech-Transformer, a no-recurrence sequence-to-sequence model entirely relies on attention mechanisms to learn the positional dependencies, which can be trained faster with more efficiency. We also propose a 2D-Attention mechanism, which can jointly attend to the time and frequency axes of the 2-dimensional speech inputs, thus providing more expressive representations for the Speech-Transformer. Evaluated on the Wall Street Journal (WSJ) speech recognition dataset, our best model achieves competitive word error rate (WER) of 10.9\%, while the whole training process only takes 1.2 days on 1 GPU, significantly faster than the published results of recurrent sequence-to-sequence models.},
  doi        = {10.1109/ICASSP.2018.8462506},
  file       = {:Dong2018 - Speech Transformer_ a No Recurrence Sequence to Sequence Model for Speech Recognition.html:URL;:dong2018.pdf:PDF},
  issn       = {2379-190X},
  keywords   = {Hidden Markov models, Encoding, Training, Decoding, Speech recognition, Time-frequency analysis, Spectrogram, Speech Recognition, Sequence-to-Sequence, Attention, Transformer},
  shorttitle = {Speech-{Transformer}},
}

BibTeX can have ISBN or ISSN or DOI or URL.

The Indexer parse the BibTeX files first and links URL to ROAM_REFS and CUSTOM_ID of the Org file. I think this quite works well.

However, some entries are books which have only ISBN. I think Promnesia extension needs to scrape identifiers(ISBN, DOI) in web-page to link it to org-roam files. Book sites except Amazon Kindle provide ISBN in open-graph meta of their web-page. But I don't think it is a good idea. It means Promnesia extension needs some identifier parsers or using extra scraping in the indexer.

Can I add it to Promnesia to scrape identifiers in a web-page? Will it be a good idea?

karlicoss commented 2 years ago

Hi! It's an interesting idea, definitely in the spirit of Promnesia!

For the backend it should be relatively easy, although will require some rethinking because currently it's aiming URLs mainly. But hopefully extracting ISBN/DOI is much easier than url and should be a simple regex.

Possible problems I can think of are mainly on the frontend:

But DOI detection could be opt-in to start with so I don't find these too concerning :)

Let me know if you want any guidance, there might be some rought edges, especially with all the extension shenanigans.

And by the way you'll be very welcome in https://memex.zulipchat.com/ -- there are spaces there to discuss Promnesia in particular and you might get some input from other people as well (you can login with github -- so won't need to create a new account!)

also related: https://github.com/karlicoss/promnesia/issues/271

sopoforic commented 2 years ago

it's very quick to query all hyperlinks from the DOM. Not sure what would it take to scrape ISBN/DOI, but hopefully if it's just a regex it should be pretty quick?

Depending on the site, these are very often in a <meta> tag. For example, this book has:

<meta content="9780191776267" property="book:isbn"/>
<meta content="10.1093/actrade/9780192840943.001.0001" name="dc.identifier"/>

An article from ACM similarly has:

<meta name="dc.Identifier" scheme="doi" content="10.1145/953051.801372">

This is typical for journal publishers' sites. It's less convenient if you're looking at other pages, but e.g. Abebooks has <meta itemprop="isbn" content="9781435127739" /> and amazon has the ASIN scattered all over including stuff like <input type="hidden" id="ASIN" name="ASIN" value="0385015836"> which shouldn't be hard to get at reliably.

sopoforic commented 2 years ago

For the backend it should be relatively easy, although will require some rethinking because currently it's aiming URLs mainly.

However, I do get orig_urls from hypothesis like urn:x-pdf:3719.. that produce norm_urls like x-pdf%3A3719f..., so certainly the world wouldn't end if we stored urn:isbn:0123456789 or doi:10.1234/5678 or even com.github.karlicoss.promnesia:novel-id:1234567 if you want to make up something non-conflicting. The canonifier just needs to emit something sensible given non-URL URIs.

karlicoss commented 2 years ago

Right -- I guess this is because the URL extractor is on the relaxed side: we'd rather detect some non-URLs than not detect some URLs, since extra broken URLs only result in minor database bloat. So if there is a separate DOI/ISBN extractor and it works, we should be fine without having to mess with URL extractor. Or we could just detect DOIs first and then subtract them from the URL set.

hwiorn commented 2 years ago

Let me know if you want any guidance, there might be some rought edges, especially with all the extension shenanigans.

And by the way you'll be very welcome in https://memex.zulipchat.com/ -- there are spaces there to discuss Promnesia in particular and you might get some input from other people as well (you can login with github -- so won't need to create a new account!)

Actually, I already in the memex chat. But I have no enough time to make an implementation now because of work. meta-data(tag) of web page that I said is what @sopoforic said is. I don't think it is good to parse every DOM and HTML using Regex which extractor can bloat easily. But always there will be exception, may be needs to specific parsers(extractors) sometimes.

However, I do get orig_urls from hypothesis like urn:x-pdf:3719.. that produce norm_urls like x-pdf%3A3719f..., so certainly the world wouldn't end if we stored urn:isbn:0123456789 or doi:10.1234/5678 or even com.github.karlicoss.promnesia:novel-id:1234567 if you want to make up something non-conflicting. The canonifier just needs to emit something sensible given non-URL URIs.

Using urn:isbn:0123456789 or urn:doi:10.1234/5678 is a good idea.