Open andrew2net opened 1 month ago
Also, the MODS dataset has 193 docs less than allrecords.xml. Also, some docs exist only in the MODS dataset, so more than 193 are missed. Here is the list of the differences diff.txt
<
.>
.start,end
for ranges of lines, or just a single number for single lines. The line numbers for the first file and the second file are separated by c
(for changed), a
(for added), or d
(for deleted).@andrew2net in the diff file I see a number of "IPD" entries.
I believe only the NIST CSRC source provides IPDs and other draft entries and the CSWPs. These were never given by the NIST-Tech-Pubs repository. Can you update the diff file? Thanks.
@ronaldtse do you mean NIST IR 8320C ipd
? There were such documents in allrecords.xml
...
<query key="IR">
<doi type="report-paper_title">10.6028/NIST.IR.8320C.ipd</doi>
...
<query key="IR">
<doi type="report-paper_title">10.6028/NIST.IR.8286D.ipd</doi>
...
Very interesting. I will report this to NIST.
The new dataset has 417 duplications. In each duplication case, the URLs of duplicated documents are identical. Here are the URLs of duplicated docs: log.txt
We now have a command that diffs the duplicated docs to see what happened:
I have posted the detailed diff to NIST here:
@andrew2net the "missing documents list" is inaccurate.
< NBS_FIPS_11-1-SEP30_1977.yaml
---
> NBS_FIPS_11-1-SEP30.yaml
< NBS_FIPS_89-SEP1.yaml
---
> NBS_FIPS_89-SEP1981.yaml
These IDs are wrong. I have reported them:
I have investigated these IDs and reported many issues to NIST:
The new dataset has 417 duplications. In each duplication case, the URLs of duplicated documents are identical. Here are the URLs of duplicated docs: log.txt
Originally posted by @andrew2net in https://github.com/relaton/relaton-nist/issues/112#issuecomment-2211570678