Open hwiorn opened 2 years ago
Hi! It's an interesting idea, definitely in the spirit of Promnesia!
For the backend it should be relatively easy, although will require some rethinking because currently it's aiming URLs mainly. But hopefully extracting ISBN/DOI is much easier than url and should be a simple regex.
Possible problems I can think of are mainly on the frontend:
<a>
box around it) to attach the visited marks etc. the DOI would just normally be within the text so might require very hacky page modifications. On the other hand, maybe even without attaching any marks to the page, just having reference DOIs in the sidebar would already give people the most benefit.But DOI detection could be opt-in to start with so I don't find these too concerning :)
Let me know if you want any guidance, there might be some rought edges, especially with all the extension shenanigans.
And by the way you'll be very welcome in https://memex.zulipchat.com/ -- there are spaces there to discuss Promnesia in particular and you might get some input from other people as well (you can login with github -- so won't need to create a new account!)
also related: https://github.com/karlicoss/promnesia/issues/271
it's very quick to query all hyperlinks from the DOM. Not sure what would it take to scrape ISBN/DOI, but hopefully if it's just a regex it should be pretty quick?
Depending on the site, these are very often in a <meta>
tag. For example, this book has:
<meta content="9780191776267" property="book:isbn"/>
<meta content="10.1093/actrade/9780192840943.001.0001" name="dc.identifier"/>
An article from ACM similarly has:
<meta name="dc.Identifier" scheme="doi" content="10.1145/953051.801372">
This is typical for journal publishers' sites. It's less convenient if you're looking at other pages, but e.g. Abebooks has <meta itemprop="isbn" content="9781435127739" />
and amazon has the ASIN scattered all over including stuff like <input type="hidden" id="ASIN" name="ASIN" value="0385015836">
which shouldn't be hard to get at reliably.
For the backend it should be relatively easy, although will require some rethinking because currently it's aiming URLs mainly.
However, I do get orig_urls
from hypothesis like urn:x-pdf:3719..
that produce norm_urls
like x-pdf%3A3719f...
, so certainly the world wouldn't end if we stored urn:isbn:0123456789
or doi:10.1234/5678
or even com.github.karlicoss.promnesia:novel-id:1234567
if you want to make up something non-conflicting. The canonifier just needs to emit something sensible given non-URL URIs.
Right -- I guess this is because the URL extractor is on the relaxed side: we'd rather detect some non-URLs than not detect some URLs, since extra broken URLs only result in minor database bloat. So if there is a separate DOI/ISBN extractor and it works, we should be fine without having to mess with URL extractor. Or we could just detect DOIs first and then subtract them from the URL set.
Let me know if you want any guidance, there might be some rought edges, especially with all the extension shenanigans.
And by the way you'll be very welcome in https://memex.zulipchat.com/ -- there are spaces there to discuss Promnesia in particular and you might get some input from other people as well (you can login with github -- so won't need to create a new account!)
Actually, I already in the memex chat. But I have no enough time to make an implementation now because of work. meta-data(tag) of web page that I said is what @sopoforic said is. I don't think it is good to parse every DOM and HTML using Regex which extractor can bloat easily. But always there will be exception, may be needs to specific parsers(extractors) sometimes.
However, I do get orig_urls from hypothesis like urn:x-pdf:3719.. that produce norm_urls like x-pdf%3A3719f..., so certainly the world wouldn't end if we stored urn:isbn:0123456789 or doi:10.1234/5678 or even com.github.karlicoss.promnesia:novel-id:1234567 if you want to make up something non-conflicting. The canonifier just needs to emit something sensible given non-URL URIs.
Using urn:isbn:0123456789
or urn:doi:10.1234/5678
is a good idea.
I'm writing an indexer for org-roam and BibTeX to link between org-roam to web-browser.
Some org-file has citation syntax like below.
The bib file would be like this.
BibTeX can have ISBN or ISSN or DOI or URL.
The Indexer parse the BibTeX files first and links
URL
toROAM_REFS
andCUSTOM_ID
of the Org file. I think this quite works well.However, some entries are books which have only ISBN. I think Promnesia extension needs to scrape identifiers(ISBN, DOI) in web-page to link it to org-roam files. Book sites except Amazon Kindle provide ISBN in open-graph meta of their web-page. But I don't think it is a good idea. It means Promnesia extension needs some identifier parsers or using extra scraping in the indexer.
Can I add it to Promnesia to scrape identifiers in a web-page? Will it be a good idea?