rdmpage / biostor

Open access articles extracted from the Biodiversity Heritage Library
5 stars 2 forks source link

Batch check for "Verhoeff 1898" #97

Open Archilegt opened 2 years ago

Archilegt commented 2 years ago

A batch check on articles authored by Carl [Karl Wilhelm] Verhoeff in year 1898 is in progress. I may add one comment per article as I go, rather than just one long comment for all articles. I will indicate with each comment whether there are no issues or if issues have been detected and action can be taken for each individual article. Completion of this batch check may take a few hours. Please, be patient.

Archilegt commented 2 years ago

I finally decided to post a "batch part" for Archiv für Naturgeschichte 64, Band 1

Verhoeff, Carl (1898): Ueber Diplopoden aus Bosnien, Herzogowina und Dalmatien. IV. Theil: Julidae. Archiv für Naturgeschichte, 64:1 (1): 119-160 + pls. V-VI. https://www.biodiversitylibrary.org/part/6891 Remarks:

  1. This article has a DOI (https://doi.org/10.5962/bhl.part.6891).
  2. Page range is correct, but the PDF is not formed with plates V and VI (found at https://www.biodiversitylibrary.org/page/14203474 and https://www.biodiversitylibrary.org/page/14203476 respectively). This is particularly remarkable for this article, as during the discussions on retro-DOIs at TDWG 2021, it was assured to me that BHL DOIs were carefully assigned after ensuring that the formed PDF would also include the plates elsewhere in a given volume. More recently (1.viii.2022), Kearney & Page (https://doi.org/10.3897/biss.6.91104) stated that “These articles now have article landing pages and (as of March 2022) pre-generated PDFs (Richard 2022), bringing them in line with modern publishing standards.” That is not the case for this article and others that I analyze in this series. In my opinion, until the pre-generated PDFs include the missing plates and missing bibliographic metadata (see point 3 below), they are not “in line with modern publishing standards”.
  3. The “number” in Biostor and “issue” in BHL are missing. The number/issue should be given as “1”, but that is not representing the “1” from the Band, but the “1” from Heft 1 within Band 1. I am managing this issue in Myriatrix by recording “64:1” as the volume and the numbers of each Heft as the issue. Ultimately, the Hefte numbers are the ones bearing importance on publication dates and should definitely be captured to make bibliographic references more accurate.

Verhoeff, Carl (1898): Ueber Diplopoden aus Bosnien, Herzogowina und Dalmatien. V. Theil: Glomeridae und Polyzoniidae (Schluss). Archiv für Naturgeschichte, 64:1 (2): 161-176 + pl. VII. https://www.biodiversitylibrary.org/part/226022 Remarks:

  1. This article does not have a DOI.
  2. Page range is correct, but the PDF is not formed with plate VII (found at https://www.biodiversitylibrary.org/page/14203478).
  3. The “number” in Biostor and “issue” in BHL is given as “1”, but that is the Band, in this case better matching a part of a volume. The correct numbers/issues are those borne by each Heft within Band 1. This article was published in Heft 2.

Verhoeff, Carl (1898): Kritisches, systematisch-historisch-litterarisches Verzeichniss der bis Ende 1897 beschriebenen Diplopoden von Oesterreich-Ungarn und dem Occupationsgebiet. Archiv für Naturgeschichte, 64:1 (3): 317-334. https://www.biodiversitylibrary.org/part/226025 Remarks:

  1. This article does not have a DOI.
  2. Page range is correct. The article has no plates. PDF formation is correct.
  3. As in point 3 of the previous reference. This article was published in Heft 3.

Verhoeff, Carl (1898): Beiträge zur Kenntniss paläarktischer Myriopoden. VI. Aufsatz: Ueber paläarktische Geophiliden. Archiv für Naturgeschichte, 64:1 (3): 335-362 + pl. VIII. https://www.biodiversitylibrary.org/part/226026 Remarks:

  1. This article does not have a DOI.
  2. Page range is correct, but the PDF is not formed with plate VIII (found at https://www.biodiversitylibrary.org/page/14203480)
  3. As in point 3 of the previous reference. This article was published in Heft 3.

Verhoeff, Carl (1898): Beiträge zur Kenntniss paläarktischer Myriopoden. VII. Aufsatz: Ueber neue und wenig bekannte Polydesmiden aus Siebenbürgen, Rumänien und dem Banat. Archiv für Naturgeschichte, 64:1 (3): 363-372 + pl. IX. https://www.biodiversitylibrary.org/part/226027 Remarks:

  1. This article does not have a DOI.
  2. Page range is 363-372, incorrect in Biostor and BHL as 363-373, which leads to the incorrect formation of a PDF with an extra page. The PDF is not formed with plate IX (found at https://www.biodiversitylibrary.org/page/14203482).
  3. As in point 3 of the previous reference. This article was published in Heft 3.

This "batch part" ends here. To be continued with articles from other journals.

Archilegt commented 2 years ago

Batch part for Zoologischer Anzeiger 21

Verhoeff, Carl (1898): Noch einige Worte über Segmentanhänge bei Insecten und Myriopoden. Zoologischer Anzeiger, 21 (549): 32-39. https://www.biodiversitylibrary.org/page/9739005 Publication date: 10/01/1898 Remarks:

  1. This article has no article-level metadata, landing page, pre-generated PDF or DOI.

Verhoeff, Carl (1898): Einige Worte über europäische Höhlenfauna. Zoologischer Anzeiger, 21 (552): 136-140. https://www.biodiversitylibrary.org/page/9739109 Publication date: 14/02/1898 Remarks:

  1. This article has no article-level metadata, landing page, pre-generated PDF or DOI.
  2. The starting page of this article is also the last page of article “2. Zur Anatomie der Dendrochiroten, nebst Beschreibungen neuer Arten”, which has partial article-level metadata, landing page, and a pre-generated PDF (see https://www.biodiversitylibrary.org/part/28248). The issue/number of article 2 is “552” but it is missing. The publication date is given as “1898” but it can be refined to “14/02/1898”.

Verhoeff, Carl (1898): Bemerkungen zur neuesten „Contribuzione alla conoscenza dei Diplopodi” des Dr. F. Silvestri. Zoologischer Anzeiger, 21 (555): 223-226. https://www.biodiversitylibrary.org/page/9739196 Publication date: 21/03/1898 Remarks:

  1. This article has no article-level metadata, landing page, pre-generated PDF or DOI.
  2. The starting page of this article is also the last page of article “1. Über einige neue Reptilien und einen neuen Frosch aus dem cilicischen Taurus ”, which has partial article-level metadata, landing page, and a pre-generated PDF (see https://www.biodiversitylibrary.org/part/28249). The issue/number of article 1 is “555” but it is missing. The publication date is given as “1898” but it can be refined to “21/03/1898”.

This "batch part" ends here. To be continued with articles from other journals.

rdmpage commented 2 years ago

Hi @Archilegt, many thanks for the very useful feedback. Just to be clear, there are IMHO two separate issues here. The first is the assignment of BHL DOIs, the second is the quality of the metadata.


This is particularly remarkable for this article, as during the discussions on retro-DOIs at TDWG 2021, it was assured to me that BHL DOIs were carefully assigned after ensuring that the formed PDF would also include the plates elsewhere in a given volume.

BHL has a rather long and tortuous relationship with DOIs. The goal @nicolekearney and I articulated applies to newly minted DOIs (of the form p.nnnnn). Some time ago BHL minted a set of DOIs for articles of the form bhl.part.nnnn. These articles weren't checked for metadata quality, not all articles in those journals where identified, and not all identified articles were assigned DOIs. Neither @nicolekearney or I were involved in that initial batch.

Since 2020 we've been working to identify (as best as possible) all the articles in a journal, have those articles checked by volunteers, consult with existing publishers (if the journal already has DOIs) and CrossRef, then mint DOIs for (ideally) all articles in a journal that BHL has access to.

I would hope that newly minted DOIs (p.nnnnn) will meet your expectations for what you'd expect from modern publisher (within the constraints that the majority of BHL content is not born digital).


The examples you give of articles lacking plates, or having incomplete metadata are well known problems. Most of the articles in BHL have been found using my semi-automated BioStor tools, which depend on the quality of metadata from various sources. If the source metadata lacks some details, so will BioStor. Given the scale of the task - identifying hundreds of thousands of articles in millions of pages - I really on automation to make some sort of headway.

The issue of missing plates is always frustrating, and typically is only resolved by manual inspection and correction.

For the articles in Archiv für Naturgeschichte 64, Band 1 I will add the missing plates that you've discovered. Regarding how to represent Band and Hefte, there seem to be multiple ways to do this, I note that ZOBODAT has:

Karl Wilhelm [Carl] Verhoeff (1898): Ueber Diplopoden aus Bosnien, Herzegowina und Dalmatien. IV. Theil: Julidae.Archiv für Naturgeschichte64-1: 119 - 160.

I'll leave it as is in BioStor.

rdmpage commented 2 years ago

@Archilegt Regarding Zoologischer Anzeiger 21 the articles you mention have not been identified in BHL (yet). The challenge is always whether there is good quality metadata available, and finding the time to process that metadata and add it too BioStor (and hence to BHL). In the case of Zoologischer Anzeiger ZOBODAT seems an obvious source https://www.zobodat.at/publikation_series.php?id=20912

Archilegt commented 2 years ago

This batch part contains a single article.

Verhoeff, Carl (1898): Ueber Diplopoden aus Kleinasien. Verhandlungen der kaiserlich-königlichen zoologisch-botanischen Gesellschaft in Wien, 48: 292-305 + pls. IV-V. https://www.biodiversitylibrary.org/part/39235

Reception date: 25/03/1898 Remarks:

  1. This article does not have a DOI.
  2. Page range is correct in article-level metadata, but the PDF is formed until page 304 and the plate IV immediately after, while leaving out four pages: 1) the blank page corresponding to the back of plate IV, 2) plate V, 3) the blank page corresponding to the back of plate V, and 4) page 305, the last page of the article. This seems to be due to PDF formation being brute-forced through the article’s metadata page interval, combined with the fact that page 304 is the last of signature 39, then coming the four pages of plates and page 305, the first of signature 40.
  3. The Verhandlungen volume 48 is scanned two times in BHL: once by the MBLWHOI Library and once by the University of Illinois Urbana-Champaign. However, only the scanned version by the MBLWHOI Library has parts. This reminds me of what I said during TDWG 2021 about same book pages in BHL needing a common resolver for deduplication and for pointing to all versions of the same page. In the case of “identified parts”, once a part is identified in one scanned version, it should be consistently applied to all scanned versions of the same book, and one resolver should be created for pointing to all of them.

This "batch part" ends here. To be continued with one more article from another journal.

rdmpage commented 2 years ago

To fix:

OK, I think we fixed the missing plate problem for these articles. The changes will take a day or two to filter through to the BHL site.

Archilegt commented 2 years ago

This batch part contains a single article.

Verhoeff, Karl (1898): Fauna diplopoda Bosne, Hercegovine i Dalmacije. Glasnik Zemaljskog muzeja u Bosni i Hercegovini, 10 (2): 467-491. http://www.bosniafacts.info/downloads/elibrary/category/4-glasnik-zemaljskog-muzeja-bosne-i-hercegovine-1889-2009?download=18:glasnik-zemaljskog-muzeja-bosne-i-hercegovine-1898-prvi-dio


  1. Journal not found in BHL. The German version is in BHL under the title “Wissenschaftliche Mitteilungen aus Bosnien und Herzegovina” (see https://www.biodiversitylibrary.org/bibliography/110065). The German version of the article, published in 1899 with at least one error, can be found at https://www.biodiversitylibrary.org/page/49030754
  2. There are at least two repositories from which information on the Bosnian version of the journal and articles can be retrieved: From Bosnia Facts (http://www.bosniafacts.info/downloads/elibrary/category/4-glasnik-zemaljskog-muzeja-bosne-i-hercegovine-1889-2009) and from INFOBIRO (see https://www.zemaljskimuzej.ba/bs/glasnik-zemaljskog-muzeja-bih and http://www.infobiro.ba/results/1?contentTypeCode=77&contentSubTypeCode=1391&contentSourceCode=1392&sortby=datum&sort_order=asc).
  3. Article-level metadata: The work appeared in the issue for April-September 1898, the second issue of that year. Issue number can be given as “2” and the publication date tentatively as “09/1898”. The scans available to me end on page 291 despite the journal index reference citing page 292. It probably was a blank page. The following article starts on page 293. Page interval recorded in Myriatrix as “467-491”.

This is the last "batch part" for "Verhoeff 1898".

Archilegt commented 2 years ago

Hi, @rdmpage! Many thanks for your feedback and insights, and for taking care of this issue. I didn't reply directly until now because I really had to focus and push through this curation challenge. It's very late for me now but I wasn't going to bed without thanking you. Have a good night!

Archilegt commented 2 years ago

Follow up: Regeneration of PDFs with plates successful for: 226022, 226026, 226027. Page interval of 226027 is now correct.

Regeneration of PDFs unsuccessful for: 6891 and 39235. Verhoeff, Carl (1898): Ueber Diplopoden aus Bosnien, Herzogowina und Dalmatien. IV. Theil: Julidae. Archiv für Naturgeschichte, 64:1 (1): 119-160 + pls. V-VI. https://www.biodiversitylibrary.org/part/6891 Remark: This is the only article with DOI. Maybe that is interfering.

Verhoeff, Carl (1898): Ueber Diplopoden aus Kleinasien. Verhandlungen der kaiserlich-königlichen zoologisch-botanischen Gesellschaft in Wien, 48: 292-305 + pls. IV-V. https://www.biodiversitylibrary.org/part/39235 Remark: This was the complex case of adding one more page and one more plate (and maybe related blank pages).

@rdmpage, could you please check this out?

rdmpage commented 2 years ago

@Archilegt Ah, I think this was my mistake 🤦‍♂️ I made the changes locally, but didn't push them to https://biostor.org, which means that BHL didn't get them. I've fixed this, so hopefully in a day or two you should be able to get new PDFs.

Archilegt commented 2 years ago

Many thanks, @rdmpage! I will follow up this issue and close it when I see the changes in BHL.

Archilegt commented 2 years ago

Follow up: Regeneration of PDF with plates successful for 39235.

Regeneration of PDF unsuccessful for 6891. Verhoeff, Carl (1898): Ueber Diplopoden aus Bosnien, Herzogowina und Dalmatien. IV. Theil: Julidae. Archiv für Naturgeschichte, 64:1 (1): 119-160 + pls. V-VI. https://www.biodiversitylibrary.org/part/6891

The PDF is not formed with plates. Plate V is found at https://www.biodiversitylibrary.org/page/14203474 Plate VI is found at https://www.biodiversitylibrary.org/page/14203476

rdmpage commented 2 years ago

Mea culpa, I'd updated "Ueber Diplopoden aus Bosnien..." locally but not passed those plates on to BHL. They should have a new PDF in the next day or so.

Archilegt commented 2 years ago

@mlichtenberg, could you please trigger the update needed above?

mlichtenberg commented 2 years ago

Hmmm, there are two issues with part 6891 (that is what is being referred to, correct?).

First, @rdmpage I don't see the item (49922) that segment appears in being sent to BHL in the last couple days. It was last sent on August 24. Second, that segment has a BHL-assigned DOI, so it will not accept updates from BioStor. BHL staff will need to update it by hand. Once that is done, the PDF should regenerate.

rdmpage commented 2 years ago

@mlichtenberg https://github.com/mlichtenberg The current version in BioStor has the plates http://biostor.org/reference/61689 http://biostor.org/reference/61689 I think because they hadn’t appeared in BHL I assumed I’d failed to update it, whereas I had on August 24th.

So it looks like the issue is the block due to the BHL-assigned DOI. Can I assume that BHL will add these extra plates?

On 31 Aug 2022, at 18:17, mlichtenberg @.***> wrote:

Hmmm, there are two issues with part 6891 (that is what is being referred to, correct?).

First, @rdmpage https://github.com/rdmpage I don't see the item (49922) that segment appears in being sent to BHL in the last couple days. It was last sent on August 24. Second, that segment has a BHL-assigned DOI, so it will not accept updates from BioStor. BHL staff will need to update it by hand. Once that is done, the PDF should regenerate.

— Reply to this email directly, view it on GitHub https://github.com/rdmpage/biostor/issues/97#issuecomment-1233209153, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAUK2VUXZKBZ6B5YRS5U2TV36HR3ANCNFSM56YHC4WA. You are receiving this because you were mentioned.

mlichtenberg commented 2 years ago

@rdmpage I added the plates to the segment, and the PDF has now been regenerated.

rdmpage commented 2 years ago

Thanks Mike, that’s great!

Archilegt commented 2 years ago

Wonderful! Many thanks @mlichtenberg and @rdmpage! I uploaded all the PDFs to the respective bibliographic references in Myriatrix.

Some summary statistics: For author=1, year=1, publications=10, one publication is not in BHL, three have no individual references (all from one journal), six have references. From the six with references, five needed curation. That gives a tiny idea of how much still needs to be done for having a complete "bibliography of life" at the author level. I will try expanding this with batch checks for Verhoeff for publication years related to the correspondence that I am processing.

@rdmpage, it may be that you would like to write some of what we did and learn here in the "Verhoeff paper" I am working on with other colleagues. The essence of it is: Martínez-Muñoz CA, Huff D, Meister M, Driller C (2022) Mobilizing and Enhancing Legacy Biodiversity Data: The case of Karl Wilhelm Verhoeff's correspondence. Biodiversity Information Science and Standards 6: e93679. https://doi.org/10.3897/biss.6.93679 but it is more than that and there is definitely space to add something about BioStor. Please, send me an email to my "archilegt" Gmail if you are interested.

Archilegt commented 2 years ago

Reopening to document bibliographic inconsistency in BHL for Archiv für Naturgeschichte 64, Band 1

Currently: Volume 64, Pages 119--160 https://www.biodiversitylibrary.org/part/6891 Volume 64, Series / Issue Issue: 1, Pages 161--176 https://www.biodiversitylibrary.org/part/226022 Volume 64, Series / Issue Issue: 1, Pages 317--334 https://www.biodiversitylibrary.org/part/226025 Volume 64, Series / Issue Issue: 1, Pages 335--362 https://www.biodiversitylibrary.org/part/226026 Volume 64, Series / Issue Issue: 1, Pages 363--373 https://www.biodiversitylibrary.org/part/226027

Observations: The first reference is better in that while missing the "volume + Band" value "64-1" (as in ZOBODAT) or "64:1" (as in Myriatrix), at least it does not introduce incorrect values as "Issue". In the following four references the issue is incorrect. Also note how the page interval of BHL part 226027 is still incorrect.

The corrected metadata should be: Volume 64-1, Series / Issue Issue: 1, Pages 119--160 https://www.biodiversitylibrary.org/part/6891 Volume 64-1, Series / Issue Issue: 2, Pages 161--176 https://www.biodiversitylibrary.org/part/226022 Volume 64-1, Series / Issue Issue: 3, Pages 317--334 https://www.biodiversitylibrary.org/part/226025 Volume 64-1, Series / Issue Issue: 3, Pages 335--362 https://www.biodiversitylibrary.org/part/226026 Volume 64-1, Series / Issue Issue: 3, Pages 363--372 https://www.biodiversitylibrary.org/part/226027 If value "64-1" is not permitted, then give "64" and correct the issue numbers as above. Correct page interval of BHL part 226027 is 363--372.

@mlichtenberg, is this something that you could manually update? Or what would it require?

Archilegt commented 2 years ago

Further details for me to investigate: The following PDFs are complete in ZOBODAT, including plates, the source is given as BHL, but there is no indication of whether they were harvested from BHL after the improvements documented here:

https://www.zobodat.at/pdf/Archiv-Naturgeschichte_64-1_0119-0160.pdf https://www.zobodat.at/pdf/Archiv-Naturgeschichte_64-1_0161-0176.pdf https://www.zobodat.at/pdf/Archiv-Naturgeschichte_64-1_0317-0334.pdf [N/A, PDF was always complete] https://www.zobodat.at/pdf/Archiv-Naturgeschichte_64-1_0335-0362.pdf

It is important to document the source and timing of the complete PDFs, to know how fixing something in BioStor improves not just BHL but also other databases like ZOBODAT. It would be important that if/when ZOBODAT harvests new PDFs, the BHL credit page is kept, among other things to allow knowing the PDF generation date. Currently the PDFs in ZOBODAT do credit BHL with a "watermark" on each page but no BHL credit page is present at the end of the PDFs, just the ZOBODAT credit page.

The following ZOBODAT reference has an incorrect page interval and no PDF associated: https://www.zobodat.at/publikation_articles.php?id=231334 Investigate if fixing the page interval at BHL and ZOBODAT would trigger the addition of a PDF from BHL part 226027 to the yet non-existent ZOBODAT URL: https://www.zobodat.at/pdf/Archiv-Naturgeschichte_64-1_0363-0372.pdf

The following PDF is complete in ZOBODAT, including plates, and the source is not BHL but ZOBODAT itself: https://www.zobodat.at/pdf/VZBG_48_0292-0305.pdf Investigate whether triggering an update replaces ZOBODAT PDFs with BHL PDFs.

mlichtenberg commented 2 years ago

@Archilegt I've submitted a ticket for the requested metadata updates to the BHL issue tracker.

This page (https://about.biodiversitylibrary.org/ufaqs/ive-noticed-a-problem-with-the-bhl-collection-or-website-what-can-i-do/) has a link to BHL's feedback form, which is where such requests can be submitted.

Archilegt commented 1 year ago

Many thanks, @mlichtenberg About the contact channel you suggested: The last time I tried using the form at https://www.biodiversitylibrary.org/contact#/comments, it wasn't working and it was not possible to know until trying to submit. I was trying to submit a bibliographic issue, e.g., request titles to scan. I then wrote to @udcmrk but I did not receive a reply. I will try the form again in the future, combined with documentation here in BioStor and other repositories I work on. If it doesn't work, I will try the feedback@biodiversitylibrary.org email in the link you provided. Thanks again!