metanorma / pubid-ieee

PubID spec and implementation for IEEE deliverables
BSD 2-Clause "Simplified" License
1 stars 0 forks source link

Fully parse all IEEE `normtitle` entries #4

Closed ronaldtse closed 2 years ago

ronaldtse commented 2 years ago

Full PubIDs from IEEE: pubid-sorted.txt.zip

$ wc -l pubid-sorted.txt
    9436 pubid-sorted.txt

Please also look at #2 and #3 for resolved details.

Method of generating this list:

  1. Extract all xml files from ieee-rawbib2
  2. Run the following command
find . -name '*.xml' -exec bash -c 'xmllint --xpath '//publication/normtitle/text()' --nocdata $0 >> pubid.txt; echo >> pubid.txt' \{} \;
sort pubid.txt | uniq > pubid-sorted.txt

Some observed rules (https://github.com/metanorma/pubid-ieee/issues/2#issuecomment-951128062):

Joint publications:

mico commented 2 years ago

@ronaldtse do you have any references to standards or IEEE PubID parsing implementations that could help?

mico commented 2 years ago

@ronaldtse what we want to do with parsed PubIDs? Do we need to convert it back to PubID, other formats?

ronaldtse commented 2 years ago

Code that is now used for PubID parsing is here: https://github.com/relaton/relaton-ieee/blob/main/lib/relaton_ieee/rawbib_id_parser.rb

ronaldtse commented 2 years ago

The source files for these entries are at https://github.com/relaton/ieee-rawbib.

There are few problems:

  1. In these XML files, there are a number of identical normtitle entries even though they have different filenames. We need to find out how to distinguish these bibliographic entries and then report back to IEEE.
  2. Some of these entries are properly associated, e.g. <standard_id> value is 0.
  3. Some of these entries are draft entries of each other. As part of Relaton, we need to find out how the entries are related and re-build the full graph.
ronaldtse commented 2 years ago

Regarding pubid, notice that there are multiple types of IEEE PubIDs, and also some jointly-published ones with ISO PubIDs. Since we now have an ISO PubID implementation, it will help us here.

mico commented 2 years ago

@ronaldtse could you tell me what problem we are trying to solve here? Do we want to convert to another format or we want to distinguish these bibliographic entries from "ieee-rawbib" / build relations graph or something else?

ronaldtse commented 2 years ago

Right now, Relaton-IEEE is unable to parse all IEEE PubID entries due to parsing through using regular expressions. It has the following consequences:

  1. We are unable to convert all ieee-rawbib data into https://github.com/ietf-ribose/relaton-data-ieee . Around 10% of entries are now missing, therefore people cannot cite from the full library. (see this: https://github.com/ietf-ribose/bibxml-service/issues/136#issuecomment-1047138249 and https://github.com/relaton/relaton-ieee/issues/16)
  2. Some entries in Relaton-IEEE are parsed wrongly. This means that people end up citing the wrong document. See this for example: https://github.com/ietf-ribose/bibxml-service/issues/31#issuecomment-1030171866

i.e. we must properly parse IEEE PubIDs in order to make the full IEEE dataset available for citation.

mico commented 2 years ago

Right now, Relaton-IEEE is unable to parse all IEEE PubID entries due to parsing through using regular expressions. It has the following consequences:

  1. We are unable to convert all ieee-rawbib data into https://github.com/ietf-ribose/relaton-data-ieee . Around 10% of entries are now missing, therefore people cannot cite from the full library. (see this: Mapping for IEEE references in bibxml6 to IEEE dataset ietf-ribose/bibxml-service#136 (comment) and Missing bibliographic items for these identifiers (from ieee-rawbib) relaton/relaton-ieee#16)
  2. Some entries in Relaton-IEEE are parsed wrongly. This means that people end up citing the wrong document. See this for example: Data mismatch when retrieving IEEE standards by xml2rfc paths ietf-ribose/bibxml-service#31 (comment)

i.e. we must properly parse IEEE PubIDs in order to make the full IEEE dataset available for citation.

Will we use it (pubid-ieee) to replace https://github.com/relaton/relaton-ieee/blob/main/lib/relaton_ieee/rawbib_id_parser.rb ?

ronaldtse commented 2 years ago

Will we use it (pubid-ieee) to replace https://github.com/relaton/relaton-ieee/blob/main/lib/relaton_ieee/rawbib_id_parser.rb ?

Yes.

mico commented 2 years ago

@ronaldtse should we use pubid-iso to parse identifiers like:

IEC/IEEE 62704-1:2017
IEC/IEEE 62704-2:2017
IEC/IEEE 62704-3:2017
IEC/IEEE 62704-4:2020
IEC/IEEE 63113:2021
IEC/IEEE 63260:2020
IEC/ISO/IEEE 80005-1:2012
ISO/IEC FDIS P15289, April 2014(E)

?

ronaldtse commented 2 years ago

The ones that start with ISO, yes. But the rest are IEC identifiers, IEC PubIDs are similar to ISO’s but they have different stages, and allow a sub part (eg IEC 1000-1-2). We need to have a pubid-iec.

mico commented 2 years ago

IEEE Std 1073.1.1.1-2004 (https://standards.ieee.org/ieee/1073.1.1.1/1571/)

image

"Replaced by ISO/IEEE 11073-10101-2004"

Example of similar identifier: P11073-10101c (https://standards.ieee.org/ieee/11073-10101c/10476/) Title: "Standard for Health informatics--Point-of-care medical device communication - Part 10101: Nomenclature Amendment 3: Additional definitions".

@ronaldtse I believe IEEE Std 1073.1.1.1-2004 should be "IEEE 1073-10101-2004" or "IEEE 11073-10101-2004", what do you think?

ronaldtse commented 2 years ago

I believe IEEE Std 1073.1.1.1-2004 should be "IEEE 1073-10101-2004" or "IEEE 11073-10101-2004", what do you think?

No, we have to keep the original identifier. Its replacement "ISO/IEEE 11073-10101-2004" probably intentionally selected the 10101 part to keep identity with 1.1.1. Notice that 1073 became 11073 because ISO 1073 is already taken by another standard. This is causality in reverse.

"P11073-10101c" means it is the "provisional" (i.e. draft) version of "11073-10101c". The "c" character means it is the 3rd Amendment to "11073-10101". According to the website, "P11073-10101c" is done in 2020 so it is a "draft amendment".

i.e. historically:

  1. IEEE 1073.1.1.1-2004 was published
  2. ISO/IEEE 11073-10101-2004 was published
  3. IEEE P11073-10101c is a draft amendment of ISO/IEEE 11073-10101-2004 (the ieee-rawbib data directly indicates which standard supersedes which)
mico commented 2 years ago

@ronaldtse "IEEE 802.15.22.3-2020" - how can I know what is 22 and 3 here?

mico commented 2 years ago

I believe IEEE Std 1073.1.1.1-2004 should be "IEEE 1073-10101-2004" or "IEEE 11073-10101-2004", what do you think?

No, we have to keep the original identifier. Its replacement "ISO/IEEE 11073-10101-2004" probably intentionally selected the 10101 part to keep identity with 1.1.1. Notice that 1073 became 11073 because ISO 1073 is already taken by another standard. This is causality in reverse.

I'm trying to find solution how I should treat these numbers. I had an idea to parse it as {number}.{part}.{subpart} but there are over 3 numbers. Maybe I can parse extra numbers as extra subparts.

ronaldtse commented 2 years ago

"IEEE 802.15.22.3-2020" "IEEE Standard for Spectrum Characterization and Occupancy Sensing":

You can see that "22.3" is called the "Part" in the draft.

Screenshot 2022-03-15 at 11 08 35 AM
ronaldtse commented 2 years ago

I had an idea to parse it as {number}.{part}.{subpart} but there are over 3 numbers.

I am not sure on whether there is a proper structure in IEEE identifiers. Some patterns are somewhat arbitrary (e.g. there exists 802.15.22.3 but not 802.15.22.1 and 802.15.22.2.)

This is a topic we will need to investigate and analyse.

mico commented 2 years ago

@ronaldtse I believe we finished with this issue

ronaldtse commented 2 years ago

@mico we have 886 identifiers that are not yet being parsed, but I will make that into a new issue.