relaton / relaton-nist

NistBib: retrieve NIST Standards for bibliographic use using the BibliographicItem model
https://www.metanorma.com
MIT License
2 stars 1 forks source link

Create relaton-data-nist #53

Closed ronaldtse closed 2 years ago

ronaldtse commented 3 years ago

There are two kinds of NIST bibdata:

We should synchronise this information daily into relaton-data-nist for easy citation.

For relaton-nist, if a document is found in the former, use it. Otherwise, search in the latter set.

ronaldtse commented 3 years ago

Related to https://github.com/usnistgov/NIST-Tech-Pubs/issues/1

andrew2net commented 3 years ago

@ronaldtse what are the references for those documents should be? For example, the first document has citation-id 78696207 and report-number NBS BH 1. Should we cite it by the "NIST 78696207" or the "NIST NBS BH 1" reference?

<body>
   <query key="BH">
      <doi type="report-paper_title">10.6028/NBS.BH.1</doi>
      <crm-item name="publisher-name" type="string">National Institute of Standards and Technology (NIST)</crm-item>
      <crm-item name="prefix-name" type="string">National Institute of Standards and Technology</crm-item>
      <crm-item name="member-id" type="number">4068</crm-item>
      <crm-item name="citation-id" type="number">78696207</crm-item>
      <crm-item name="book-id" type="number">2050209</crm-item>
      <crm-item name="deposit-timestamp" type="number">201511031134</crm-item>
      <crm-item name="owner-prefix" type="string">10.6028</crm-item>
      <crm-item name="last-update" type="date">2018-03-06T09:55:24Z</crm-item>
      <crm-item name="created" type="date">2015-11-04T17:31:05Z</crm-item>
      <crm-item name="citedby-count" type="number">0</crm-item>
      <doi_record>
         <report-paper>
            <report-paper_metadata language="en">
               <contributors>
                  <person_name sequence="first" contributor_role="author">
                     <given_name>Ira H</given_name>
                     <surname>Woolson</surname>
                  </person_name>
                  <person_name sequence="additional" contributor_role="author">
                     <given_name>Edwin H</given_name>
                     <surname>Brown</surname>
                  </person_name>
                  <person_name sequence="additional" contributor_role="author">
                     <given_name>John A</given_name>
                     <surname>Newlin</surname>
                  </person_name>
                  <person_name sequence="additional" contributor_role="author">
                     <given_name>William K</given_name>
                     <surname>Hatt</surname>
                  </person_name>
                  <person_name sequence="additional" contributor_role="author">
                     <given_name>Ernest J</given_name>
                     <surname>Russell</surname>
                  </person_name>
                  <person_name sequence="additional" contributor_role="author">
                     <given_name>Rudolph P</given_name>
                     <surname>Miller</surname>
                  </person_name>
                  <person_name sequence="additional" contributor_role="author">
                     <given_name>Joseph R</given_name>
                     <surname>Worcester</surname>
                  </person_name>
                  <person_name sequence="additional" contributor_role="author">
                     <given_name>Frank P</given_name>
                     <surname>Cartwright</surname>
                  </person_name>
               </contributors>
               <titles>
                  <title>Recommended minimum requirements for small dwelling construction :</title>
                  <subtitle>report of Building Code Committee July 20, 1922</subtitle>
               </titles>
               <edition_number>0</edition_number>
               <publication_date media_type="online">
                  <year>1923</year>
               </publication_date>
               <publisher>
                  <publisher_name>National Bureau of Standards</publisher_name>
                  <publisher_place>Gaithersburg, MD</publisher_place>
               </publisher>
               <institution>
                  <institution_name>National Bureau of Standards</institution_name>
                  <institution_acronym>NBS</institution_acronym>
                  <institution_place>Gaithersburg, MD</institution_place>
               </institution>
               <publisher_item>
                  <item_number item_number_type="report-number">NBS BH 1</item_number>
               </publisher_item>
               <doi_data>
                  <doi>10.6028/NBS.BH.1</doi>
                  <resource>https://nvlpubs.nist.gov/nistpubs/Legacy/BH/nbsbuildinghousing1.pdf</resource>
               </doi_data>
            </report-paper_metadata>
         </report-paper>
      </doi_record>
   </query>
...
ronaldtse commented 3 years ago

@andrew2net the proper citation document identifier is "NBS BH 1" in this case.

NBS is the predecessor of NIST, so:

We can actually take hint from this:

      <doi type="report-paper_title">10.6028/NBS.BH.1</doi>

The IDs that look like integer are clearly machine generated and possibly not for human citational use.

andrew2net commented 3 years ago

@ronaldtse NBS IR 87-363 contains "error:" Maybe NIST shoud know about it?

               <publisher>
                  <publisher_name>error:</publisher_name>
                  <publisher_place>Gaithersburg, MD</publisher_place>
               </publisher>
               <institution>
                  <institution_name>error:</institution_name>
                  <institution_acronym>error:</institution_acronym>
                  <institution_place>Gaithersburg, MD</institution_place>
               </institution>
ronaldtse commented 3 years ago

Yes! @andrew2net can you file a new issue here?

andrew2net commented 3 years ago

@ronaldtse the source contains relations with doi type identifiers. Can we use doi id as a formattedref?

<related_item>
  <intra_work_relation relationship-type="replaces" identifier-type="doi">10.6028/NIST.SP.1108r3</intra_work_relation>
</related_item>
<related_item>
  <intra_work_relation relationship-type="isVersionOf" identifier-type="doi">10.6028/NIST.SP.1108</intra_work_relation>
</related_item>
ronaldtse commented 3 years ago
  1. We can use the doi ID as input to formattedref.
  2. doi is not the formattedref.

Metanorma already implements the new NIST PubID scheme, which has defined transforms from machine-readable IDs to:

And we need to parse these old DOIs back to PubID.

So we need to extract that code out from metanorma-nist: https://github.com/metanorma/nist-pubid/issues/1

Then we can re-use that in relaton-nist.

andrew2net commented 3 years ago

@ronaldtse there are documents like NBS.BMS.140e2. It looks like it's a second edition but the document contains

<edition_number>0</edition_number>

should we ignore the edition_number tag if there is an edition in ID?

ronaldtse commented 2 years ago

@andrew2net https://github.com/usnistgov/NIST-Tech-Pubs/issues/1 has been fixed, can you help update the location of the XML file? Thanks.

ronaldtse commented 2 years ago

Issue https://github.com/relaton/relaton-nist/issues/53#issuecomment-884810725 is posted in #55.

Can we close this ticket?

andrew2net commented 2 years ago

@ronaldtse no, the relaton-data-nist isn't ready. It needs to convert DOI IDs to PubIDs to be able to reference the documents. But the DOI IDs in the source aren't the same as MR IDs. I have many questions about how to map parts of DOI IDs to PubIDs. I'll ask you later. Have a lot of other tasks to finish.

andrew2net commented 2 years ago

Also, we need to move documents from the https://csrc.nist.gov/CSRC/media/feeds/metanorma/pubs-export.zip file to this repo to solve a problem similar to https://github.com/relaton/relaton-calconnect/issues/11

ronaldtse commented 2 years ago

@andrew2net sure, let's merge the bibdata from CSRC into this collection.

andrew2net commented 2 years ago

@ronaldtse the source has some DOI identifiers what need clarification how should they be mapped to PubID:

  1. NBS.CIRC.15-April1909 - is this docnumber 15 and update-date April 1909?
  2. NBS.CIRC.25insert - what does the insert mean in this reference? How shoud it be mapped to PubID?
  3. NBS.CIRC.25sup-1924, NBS.CIRC.398sup1937, NBS.CIRC.154suprev, NBS.HB.28supp1949 - Whai is the sup? Is the supp same as sup?
  4. NBS.CIRC.488sec1 - How should the sec be mapped to PubID?
  5. NBS.CIRC.54index, NBS.NSRDS.63indx - index and indx?
  6. NBS.CIRC.74errata - errata?
  7. NBS.CRPL.1-2_3-1, NBS.CRPL.1-2_3-1A, NBS.CRPL.4-m-5, NBS.CRPL.c4-4 - Are the 1-2_3-1, 1-2_3-1A, 4-m-5, c4-4 docnumbers or doncumbers with parts?
  8. NBS.FIPS.100-1-1991 - is this part 1 and update-date 1991?
  9. NIST.IR.6867es - es?
  10. NIST.IR.7297c - c?
  11. NIST.IR.8115chi - chi?
  12. NIST.IR.8115viet - viet?
  13. NIST.IR.8178port - port?
  14. NIST.NCSTAR.1-1av1, NCSTAR.1-1cv1, NIST.NCSTAR.1-2bv1 - av, cv, bv?
  15. NIST.SP.1011-I-2.0 - is 1011-I-2.0 a docnumber?
  16. NIST.SP.1075-NCNR - NCNR?
  17. NIST.SP.800-131Ar1 - Ar?
  18. NIST.SP.800-28ver2 - Is ver a version? How should it be mapped to PubID?
  19. NIST.SP.800-38a-add - add?
  20. NIST.SP.800-57pt1r4 - pt?
  21. NIST.SP.801-errata - errata?
  22. NIST.SP.955.Suppl - Suppl?
  23. NIST.AMS.300-8r1/upd, NIST.IR.8115r1-upd - upd?
ronaldtse commented 2 years ago
  1. NBS.CIRC.15-April1909 - is this docnumber 15 and update-date April 1909?

https://nvlpubs.nist.gov/nistpubs/Legacy/circ/nbscircular15-April1909.pdf

Screenshot 2021-08-17 at 8 49 37 AM

This is NBS CIRC ("Circular") No. 15. Yes docnumber=15, series CIRC/Circular, date=1909-04.

  1. NBS.CIRC.25insert - what does the insert mean in this reference? How shoud it be mapped to PubID?

I think insert means that it's an "included document" inside another document.

In this case, it means this is an "insert" of NBS CIRC 25. The "ins" part can be considered as in the same category like "supplement". Just as we can have "Supplement 1", we can have "Insert 1".

https://www.govinfo.gov/app/details/GOVPUB-C13-45974defbd2f3d7ab324bcd3506831b7

Screenshot 2021-08-17 at 8 51 29 AM
  1. NBS.CIRC.25sup-1924, NBS.CIRC.398sup1937, NBS.CIRC.154suprev, NBS.HB.28supp1949 - Whai is the sup? Is the supp same as sup?

"sup" and "supp" probably mean Supplement. Supplement is a supported type.

  1. NBS.CIRC.488sec1 - How should the sec be mapped to PubID?

"sec" is Section. Treat it as similar to "Part", where we can have "Part 1" (pt1), we can have "Section 1" (sec1).

  1. NBS.CIRC.54index, NBS.NSRDS.63indx - index and indx?

Both mean "index". Treat it as like Supplement and Insert.

  1. NBS.CIRC.74errata - errata?

Errata. Treat it as like Supplement and Insert.

  1. NBS.CRPL.1-2_3-1, NBS.CRPL.1-2_3-1A, NBS.CRPL.4-m-5, NBS.CRPL.c4-4 - Are the 1-2_3-1, 1-2_3-1A, 4-m-5, c4-4 docnumbers or doncumbers with parts?

Let's treat them as docnumbers, yes. But did you notice these entries have assigned numbers? Then we don't need to parse the DOIs for them. See this: https://pages.nist.gov/NIST-Tech-Pubs/CRPL.html .

Screenshot 2021-08-17 at 9 19 15 AM

https://nvlpubs.nist.gov/nistpubs/Legacy/crpl/crpl-1-2_3-1.pdf

Screenshot 2021-08-17 at 9 13 37 AM
  1. NBS.FIPS.100-1-1991 - is this part 1 and update-date 1991?

Yes.

  1. NIST.IR.6867es - es?

es means Spanish. This is the language, which PubID supports.

https://nvlpubs.nist.gov/nistpubs/Legacy/IR/nistir6867es.pdf

Screenshot 2021-08-17 at 9 20 02 AM
  1. NIST.IR.7297c - c?

Part C.

https://nvlpubs.nist.gov/nistpubs/Legacy/IR/nistir7297c.pdf

Screenshot 2021-08-17 at 9 20 40 AM
  1. NIST.IR.8115chi - chi?

Language: Chinese.

  1. NIST.IR.8115viet - viet?

Language: Vietnamese.

  1. NIST.IR.8178port - port?

Language: Portuguese.

  1. NIST.NCSTAR.1-1av1, NCSTAR.1-1cv1, NIST.NCSTAR.1-2bv1 - av, cv, bv?

https://nvlpubs.nist.gov/nistpubs/Legacy/NCSTAR/ncstar1-1av1.pdf

Screenshot 2021-08-17 at 9 22 08 AM
  1. NIST.SP.1011-I-2.0 - is 1011-I-2.0 a docnumber?

Docnumber is 1011. Volume is 1. Version is 2.0.

https://www.nist.gov/system/files/documents/el/isd/ks/NISTSP_1011-I-2-0.pdf

Screenshot 2021-08-17 at 9 22 54 AM
  1. NIST.SP.1075-NCNR - NCNR?

NCNR is the "NIST Center for Neutron Research".

This is very funny -- this is a case of a "duplicated" SP 1075!!

https://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication1075-NCNR.pdf

Screenshot 2021-08-17 at 9 28 06 AM

https://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication1075-PML.pdf

Screenshot 2021-08-17 at 9 28 34 AM

So we need to find a way to resolve this... argh.

In this case, "1075-NCNR" is the docnumber.

Will report this to NIST.

  1. NIST.SP.800-131Ar1 - Ar?

This means Part A, Revision 1.

https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-131Ar1.pdf

Screenshot 2021-08-17 at 9 30 16 AM
  1. NIST.SP.800-28ver2 - Is ver a version? How should it be mapped to PubID?

"Version" is a supported element just like "Revision".

  1. NIST.SP.800-38a-add - add?

Addendum to SP 800-38 Part A.

  1. NIST.SP.800-57pt1r4 - pt?

Part 1.

  1. NIST.SP.801-errata - errata?

As above.

  1. NIST.SP.955.Suppl - Suppl?

Supplement.

  1. NIST.AMS.300-8r1/upd, NIST.IR.8115r1-upd - upd?

https://nvlpubs.nist.gov/nistpubs/ams/NIST.AMS.300-8r1.pdf

Screenshot 2021-08-17 at 9 31 45 AM

https://nvlpubs.nist.gov/nistpubs/ams/NIST.AMS.300-8r1-upd.pdf

Screenshot 2021-08-17 at 9 32 17 AM

"INCLUDES UPDATES AS OF 02-08-2021".

This is an "errata update". From https://github.com/metanorma/nist-pubid/blob/master/README.adoc#4-machine-readable-form , this applies:

If a superseding edition is just an errata update, we can use the update date from the title page (“includes updates as of…”) to uniquely identify this edition. Preferably use -yyyymmdd format.

ronaldtse commented 2 years ago

@andrew2net I've updated nist-pubid's README to reflect these element changes, please check.

UPDATE: I actually went through the full set of documents for all series (see https://github.com/metanorma/nist-pubid/issues/4), so the PubID scheme should work.

andrew2net commented 2 years ago

Let's treat them as docnumbers, yes. But did you notice these entries have assigned numbers? Then we don't need to parse the DOIs for them.

@ronaldtse I've tried to use the assigned numbers but some of them are duplicated. For example: NBS CIRC 46e2, NIST HB 105-1-1990, NBS HB 67suppJune1965 ...

ronaldtse commented 2 years ago

@andrew2net do you mean that NBS CIRC 46e2 has an identical assigned number with NBS CIRC 46?

andrew2net commented 2 years ago

@ronaldtse I found NBS.CIRC.36e2 and NBS.CIRC.46e2 with NBS CIRC 46e2 item number, which looks like a mistake.

UPDATE: Here are all duplicates:

["NBS CIRC 46e2",
 "NIST HB 105-1-1990",
 "NBS HB 67suppJune1965",
 "NIST IR 89-4220",
 "NBS TN 789-1",
 "NIST HB 150-10",
 "NIST IR 8115",
 "NIST IR 8117",
 "NIST IR 8119",
 "NIST IR 8178",
 "NIST TN 1648"]
ronaldtse commented 2 years ago

@andrew2net in this case can you create an issue at nist-pubid about that mistake? Thanks.

andrew2net commented 2 years ago

@ronaldtse These references NBS.CIRC.sup, NBS.CIRC.supJun1925-Jun1926, NBS.CIRC.supJun1925-Jun1927 don't have docnumber. Is it possible to have PubID without docnumber? Another question is: how to handle 2 dates in the last couple of references?

UPDATE There are also references like NBS.RPT.Apr-Jun1948.

ronaldtse commented 2 years ago

@andrew2net I've moved your last comment to a new issue. Let's not stack up the requests in this issue 😉

andrew2net commented 2 years ago

@ronaldtse there are DOIs with language and the documents with the DOIs has translated titles. It seems PubID doesn't support languages. Instead we have language attribute within titles in our data model. So we need to collect all the title translations into one document, do we? Chinees documents don't have translated titles. However the Chinees documents (and other non English documents) have link to translated PDF files. But we don't have a laguage attribute for TypedUri in the data model. Do we need to collect all these links? May be we need to add a laguage attribute to the TypedUri element. What do you think?

ronaldtse commented 2 years ago

@andrew2net we do not need to parse the set perfectly right now.

Let’s make sure we have most done and then file additional issues. Relationships between translated documents are not important right now.

We are in a hurry to have the first cut.

andrew2net commented 2 years ago
  • documents from the NIST CSRC (NIST SP 800, etc), should still come from the NIST Metanorma endpoint (which is much richer in information and updated daily)

@ronaldtse now we have 3 sources for NIST documents:

  1. https://csrc.nist.gov/CSRC/media/feeds/metanorma/pubs-export.zip
  2. https://csrc.nist.gov/search
  3. https://raw.githubusercontent.com/usnistgov/NIST-Tech-Pubs/nist-pages/xml/allrecords.xml

Is there a way to detect which source should be used for certain reference?

ronaldtse commented 2 years ago

We will only use 1 and 3 from now on. They will already represent the full information of all NIST publications. For a reference we will prioritize the information of 1 over 3.

andrew2net commented 2 years ago

@ronaldtse it seems the 1 and 3 don't represent full information. For example SP 800-55 Rev. 2 (Draft) is only in https://csrc.nist.gov/search.

ronaldtse commented 2 years ago

@andrew2net interesting! In this case we should consider this a bug in 1. The results from 1 and 2 are supposed to be identical. I will report and revert.

ronaldtse commented 2 years ago

In any case, we will migrate to a full-data approach with NIST instead of using dynamic scraping. Please help proceed.

ronaldtse commented 2 years ago

The results from 1 and 2 are supposed to be identical. I will report and revert.

NIST CSRC responded that endpoint 1 is now fixed. Thanks guys!