relaton / relaton-nist

NistBib: retrieve NIST Standards for bibliographic use using the BibliographicItem model
https://www.metanorma.com
MIT License
2 stars 1 forks source link

Data source problem: the new dataset has duplications and misses some documents #113

Open andrew2net opened 1 month ago

andrew2net commented 1 month ago

The new dataset has 417 duplications. In each duplication case, the URLs of duplicated documents are identical. Here are the URLs of duplicated docs: log.txt

Originally posted by @andrew2net in https://github.com/relaton/relaton-nist/issues/112#issuecomment-2211570678

andrew2net commented 1 month ago

Also, the MODS dataset has 193 docs less than allrecords.xml. Also, some docs exist only in the MODS dataset, so more than 193 are missed. Here is the list of the differences diff.txt

ronaldtse commented 1 month ago

@andrew2net in the diff file I see a number of "IPD" entries.

I believe only the NIST CSRC source provides IPDs and other draft entries and the CSWPs. These were never given by the NIST-Tech-Pubs repository. Can you update the diff file? Thanks.

andrew2net commented 1 month ago

@ronaldtse do you mean NIST IR 8320C ipd? There were such documents in allrecords.xml

   ...
   <query key="IR">
      <doi type="report-paper_title">10.6028/NIST.IR.8320C.ipd</doi>
      ...
   <query key="IR">
      <doi type="report-paper_title">10.6028/NIST.IR.8286D.ipd</doi>
      ...
ronaldtse commented 1 month ago

Very interesting. I will report this to NIST.

ronaldtse commented 1 month ago

The new dataset has 417 duplications. In each duplication case, the URLs of duplicated documents are identical. Here are the URLs of duplicated docs: log.txt

We now have a command that diffs the duplicated docs to see what happened:

I have posted the detailed diff to NIST here:

ronaldtse commented 1 month ago

@andrew2net the "missing documents list" is inaccurate.

< NBS_FIPS_11-1-SEP30_1977.yaml
---
> NBS_FIPS_11-1-SEP30.yaml
< NBS_FIPS_89-SEP1.yaml
---
> NBS_FIPS_89-SEP1981.yaml

These IDs are wrong. I have reported them:

I have investigated these IDs and reported many issues to NIST: