rdmpage / biostor

Open access articles extracted from the Biodiversity Heritage Library
http://biostor.org
5 stars 2 forks source link

Making Myriatrix work for BioStor #100

Open Archilegt opened 2 years ago

Archilegt commented 2 years ago

This would be a development rather than an issue.

Questions: Is it feasible to use Myriatrix biblio database as a source of references for BioStor? If not, why not? If "yes, but...", what is missing or needs to be solved?

rdmpage commented 2 years ago

Yes it could, although:

Archilegt commented 2 years ago

@rdmpage, thanks for the feedback.

We have to find a way to blend sources for the BoL, and work with both taxonomic databases and journal indexes. BIOfid could take on the path of fully indexing one German or German-language journal at the time but 1) this has to be discussed and 2) still those references should aggregate into a larger database that can be community-curated and have some degree of authority control. Such a database should also be able to aggregate references coming from taxonomic communities, and should be able to serve them back for citing and building reference lists while writing taxonomic papers. That way the communities will get their recording effort back.

We have been talking of the BoL for the past 12 years, and we still don't have a global database where we can log in as experts and add and curate references that can be reused by biologists and especially taxonomists in the way we need it. That database is not a Wikipedia or Wikidata. That database would be an expert-curated BoL Scratchpad that 1) allows direct input, 2) aggregates references from all Scratchpads, and 3) serves the references back to everyone, e.g., other Scratchpads which need the same references, users using Publication templates, the BHL, BioStor, BIOfid, Wikidata, und so weiter.

I propose that we move on with creating a Bibliography of Life Scratchpad as soon as a decision is made on which Drupal version (8, 9, or 10) will be used to deploy Scratchpads 3.0.

rdmpage commented 2 years ago

@Archilegt

It looks like I can parse the BibTeX output which is more detailed. Not my favourite format, but usable.

Re BoL I think Wikidata is the obvious platform for this, simply because of the size of the community and the richness of the data model. You don't want just a list of references, you want links to identifiers, translations, versions, citations, etc. and the ability to provide sources for data (e.g., dates). It seems crazy to attempt to replicate this in yet another database.

There is the issue that Wikidata itself is not the friendliest for searching, but I think we just need to put a nice interface over the top of it. This is the motivation for experiments like http://alec-demo.herokuapp.com and https://wikicite-search.herokuapp.com

Archilegt commented 2 years ago

For more I think that you are right, and for more I think myself that Wikidata is the future, I keep looking at the present and at the social aspect of bibliographic reference recording and data curation. I do not believe that we can engage taxonomic communities into being Wikipedians but I do believe that we can engage taxcoms in being Scratchpadians. The Wiki world is not well accepted because it is open to everyone, while experts want to feel in control and discuss among their peers. If we do not provide an expert platform, we will lose their input. Again, this is not just about searching, this is about reusing for writing publications. That was one of the core ideas since at least 2011 and remained so in 2013 and beyond. We have to give the communities their time input back. Scratchpads has a publication template that can be diversified, and it has a cite button that generates a citation and the corresponding bibliography within a manuscript. Also, the multiple uses of recorded references in a Scratchpads are an additional motivation to record them in the first place. There are at least five different uses for publications in Scratchpads, e.g., 1) the reference itself, 2) populating a scientific name with its original source, 3) creating a bibliography for a given name via reference taxon-tagging, 4) citing and automatically creating bibliographies in text added to Taxon Description content via the [bib] tag, and 5) citing and automatically creating bibliographies in manuscripts via the cite button. See biblio example for 1), which is then reused for the scientific name Dendrothereua linceci as in 2). A bibliography for D. linceci as in 3) is also provided, additionally displaying the API results of ReFindIt and BHL. See example for 4) in the Taxon Description page of Dendrothereua linceci, section Distribution. The underlying Species Profile Model is now the Darwin Core Vocabulary Description Type GBIF Vocabulary. The vocabulary and resulting DwC-A are flat and do not take the reference list with them into GBIF, but the text and the [bib] tags with the unique internal number are exported and could be reattached by developing the DwC-A further. As long as researchers are mostly and regrettably measured by their numerical publication output and impact factors instead of by data recording, structuring, and curation, tools that directly reward their efforts will be preferred over tools that don't. Looking at the present, Scratchpads is more relevant than Wikidata, and if the content of a BoL Scratchpads can be exported to Wikidata, nothing is lost but a lot would be won. About data linking, it is completely possible to create a publication star-schema that can be imported by WikiData, or reuse any that exists. It is all about mapping. About sources for publication dates, I do that by hand for each publication, while maintaining this list. If we are to have a BoL Scratchpad, of course that we can build a content type where we record publication date sources. We can tailor it to our needs and much can be done without coding because of the flexibility of the Drupal CMS. What we need are use cases and examples, and work in getting it done. My list and the list by Neal Evenhuis can be used as the core for development.

Archilegt commented 2 years ago

Actually, issue "Pagination missing in RIS export files" was opened by me in the Scratchpads GitHub on 3 January 2022. I will assist Rob Davies in getting it fixed. @rdmpage, I will let you know if we succeed.

rdmpage commented 2 years ago

@Archilegt I've written some code to handle BibTeX so am happy to use that. The RIS export lacks DOIs. I also get different numbers of reference depending on whether I download BibTeX or RIS, so not quite sure what's going on.

Archilegt commented 2 years ago

@rdmpage, your new code is great news! Let's try getting you all the refs! On my dashboard, I see 1309 published refs. Which of the files gets closer to that number?

image

About the DOI, I reported the issue for RIS imports (but not for exports until now) more than two years ago, on 1 March 2020. See: Please add RIS field identifier DO It may be a mapping issue and fixing it may solve both the RIS import and export.

rdmpage commented 2 years ago

@Archilegt I spoke too soon. As far as I can tell Drupal's bibliographic export sucks. The files I'm downloading have lots of duplicate references 👎

Archilegt commented 2 years ago

@rdmpage, I am sorry to hear that, and also surprised. Until now I have believed duplicates to be rare. May it be that what looks like duplicate references is me using the field "Abstract" to enter a "Recommended citation", which is the reference rewritten as a text string? E.g.,

@article {1616,
    title = {Om n{\r a}gra exotiska Myriopoder},
    journal = {Bihang till Kongl. Svenska vetenskaps-akademiens handlingar},
    volume = {4},
    year = {1876},
    pages = {1-48},
    abstract = {<p><strong>Recommended citation:</strong> Porat, C. O. v. (1876): Om n\&aring;gra exotiska Myriopoder. <em>Bihang till Kungliga Svenska Vetenskaps-Akademiens Handlingar</em>, 4 (7): 1-48. https://www.biodiversitylibrary.org/page/14206168</p>
},
    url = {https://www.biodiversitylibrary.org/page/14206168},
    author = {Porat, C. O. v.}
}
rdmpage commented 2 years ago

@Archilegt No, I think the problem is with Drupal. If I go to any ScratchPad, click on the "LITERATURE" tab and then click on "Export selection: on the left (to get the whole bibliography) I get duplicates. There is something deeply broken here.

Archilegt commented 2 years ago

@rdmpage ... the horror! :-( Maybe related to Citation Key values always even numbers, though that seems not to be the case anymore.

Archilegt commented 2 years ago

@rdmpage, what about here: https://aba.myspecies.info/biblio The size of that literature database is more manageable. Do you detect duplicates in the RIS? I searched for "Australasian" and I got two ref hits, which is correct, but I didn't attempt finding duplicates in the whole RIS file.

rdmpage commented 2 years ago

If you open it in a text editor you’ll see massive duplication. Probably simplest way is to sort the lines and look at those lines starting “TI - “.

On 30 Sep 2022, at 12:05, Archilegt @.***> wrote:

@rdmpage https://github.com/rdmpage, what about here: https://aba.myspecies.info/biblio https://aba.myspecies.info/biblio The size of that literature database is more manageable. Do you detect duplicates in the RIS? I searched for "Australasian" and I got two ref hits, which is correct, but I didn't attempt finding duplicates in the whole RIS file.

— Reply to this email directly, view it on GitHub https://github.com/rdmpage/biostor/issues/100#issuecomment-1263433916, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAUK2WLAO2EAAE4NVUYPY3WA3COVANCNFSM6AAAAAAQWQ5SUA. You are receiving this because you were mentioned.

Archilegt commented 2 years ago

I finished installing Zotero and uploading a RIS file. Remarks:

  1. Zotero imported 1300 references with the file, which does not match the 1309 references previously shown on my Myriatrix Scratchpad automatic reference count. Cause of mismatch: Unknown. It may be that recent references need a back-end database update to get into the actual biblio exports. Documentation on the frequency of such updates is required.
  2. Total mess. I don't even know what I'm looking at. It seems like each reference got split as many times as RIS fields each had, and titles were annotated with the links to BHL landing pages.

image

  1. Aaaaaaaaaaarrrrrrggggghhhhh!
  2. With tool Duplicate items -> Option Generate Report from Items, it is possible to visualize that the entries are full duplicates and not fragmented references.

image

  1. There seems to be some sort of regularity. Zotero counts 26 duplicates per reference and offers to merge them.

image

  1. After deduplication, only the tiny amount of 50 references remain, which means that at least for Myriatrix, most references actually don't get exported at all.

image

  1. As in 3 above.
Archilegt commented 1 year ago

Hi, @rdmpage Please see Update biblio export modules ...and Exporting bibliography (format: BibTeX) problem It may be that the Myriatrix Literature RIS export is working now. There should be 1500 references or so.

Archilegt commented 1 year ago

No, it's not yet working for Myriatrix. Same dumb replication error, 50 references as before, 30-plicated, equals 1500.

rdmpage commented 1 year ago

I guess I'll wait until you and @benscott figure out the problem.