metanorma / pubid-ieee

PubID spec and implementation for IEEE deliverables
BSD 2-Clause "Simplified" License
1 stars 0 forks source link

Parse IEEE draft documents #11

Open ronaldtse opened 2 years ago

ronaldtse commented 2 years ago

IEEE draft documents can have the following patterns:

ANSI PC63.10/D14, April 2020
ANSI PC63.12/D12e, January 2015
IEEE 1250 /D11 May 2010
IEEE Std PC37.20.1b/D2
IEEE Std PC37.12.1/D2.0
IEEE Draft P802.11-REVmb/D3.0, March 2010 (Revision of IEEE Std 802.11-2007, as amended by IEEE Std 802.11k-2008, IEEE Std 802.11r-2008, IEEE Std 802.11y-2008, IEEE Std 802.11w-2009 and IEEE Std 802.11n-2009)
mico commented 2 years ago

@ronaldtse there are documents with "r" in identifier, e.g. "IEEE P11073-10101/D3r7, September 2018" Is it a document revision or draft revision?

ronaldtse commented 2 years ago

@mico I believe D3r7 means "3rd draft, revision 7". The concept of "revision" in the IEEE PubID does not seem to be in common use, but I guess there are 11 instances for 2 documents in the entire library!

IEEE P11073-10101/D3r7, September 2018
IEEE P11073-10101/D4r1, January 2019
IEEE P11073-10101/D5r4, February 2019
IEEE P11073-10101/D7r1, March 2019
IEEE P11073-10101/D9r1, April 2019
IEEE P11073-10471/D2r2, April 2020
IEEE P11073-10471/D3r2, November 2021
IEEE P11073-110101/D8r1, April 2019
IEEE P1242/D8r2, June 2016
IEEE P1242/D8r3, July 2016

I think these two patterns are identical in intention:

There are a lot more of the "dot notation" -- 1439 of them. Let's treat them as the same, and use the "dot notation" as the output format.

mico commented 2 years ago

@ronaldtse "IEEE P1609.2.1/D12D14" What is D14 here? Is it another draft? Tried to find out by myself but didn't find this document with "D12D14".

ronaldtse commented 2 years ago

Looking at the examples:

IEEE P1609.2.1/D10, February 2020
IEEE P1609.2.1/D12, June 2020
IEEE P1609.2.1/D12D14, June 2020
IEEE P1609.2.1/D15, August 2020
IEEE P1609.2.1/D4, November 2021
IEEE P1609.2.1/D6, January 2022

I think this is a typo for D14. Let's make this a single time replacement (we should have a set of special cases to replace these errors) so it is not in the parse rules.

mico commented 2 years ago

I think this is a typo for D14. Let's make this a single time replacement (we should have a set of special cases to replace these errors) so it is not in the parse rules.

There are 7 cases like this:

IEEE P11073-10420/D4D5, March 2020
IEEE P1609.2.1/D12D14, June 2020
IEEE P1653.5/D7d1 November, 2019
IEEE P3002.2/D6D7, April 2017
IEEE P515/D4D5, March 2017
IEEE PC57.143 /D24D25, October 2012
IEEE Unapproved Draft Std P1680/D4D6, Aug 2009
ronaldtse commented 2 years ago

You're right. This is clearly intentional.

When I see these:

P3002.2/D6, Oct 2015
P3002.2/D6D7, Apr 2017
P3002.2/D7, Sept 2017

I think this means it is a "pre-D7 coming from D6".

Given that there are:

P1680/D4D6, Aug 2009
P1609.2.1/D12D14
P352/D4D6, Feb 2016

The pre-draft and intended target draft numbers are not consecutive.

So we have to have parse these two numbers separately.

mico commented 2 years ago

@ronaldtse should I ignore "REV" here or just leave it as part of number?

IEEE P802-REV/D1.7
IEEE P802-REV/D1.9
ronaldtse commented 2 years ago

@mico leave REV as part of the number.

Look at these two entries, the "REV" is part of the number (the first one in a superseding relationship, the second in a title):

"P802.11-REVma/D5.0 (Superseded by P802.11-REVma_D6.0)"
"P802.11ai/D11.0 Sept 2016 - IEEE Approved Draft Standard for Information technology-Telecommunications and information exchange between systems-Local and metropolitan area networks-Specific requirements Part 11: Wireless LAN Medium Access Control (MAC) and Physical Layer (PHY) Specifications: Amendment to IEEE P802.11-REVmc(TM)/D8.0: Fast Initial Link Setup"
mico commented 2 years ago

@ronaldtse What to do with the documents like:

IEEE Unapproved Draft Std 11073-10471/D02, Feb 2008
IEEE Unapproved Draft Std 11073-10472/D02, Apr 2009
IEEE Active Unapproved Draft Std PC37.59/D11, Jul 2007
IEEE Active Unapproved Draft Std PC57.129/D10, Jul 2007
IEEE Active Unapproved Draft Std PC62.21/D2, Jul 2007
IEEE Approved Draft Std C57.12.35/D7, 07
IEEE Approved Draft Std P1076.1/D3.3, Feb 6, 2007
IEEE Approved Draft Std P11073-10415/D11, Aug 2008

Do we need to keep "Approved Draft", "Active Unapproved Draft", "Unapproved Draft" in resulting PubID?

ronaldtse commented 2 years ago

@mico I believe we should keep these statuses as part of the PubID.

mico commented 2 years ago

@ronaldtse there are many identifiers with word "Unapproved" but without "Draft" after it. Should we add "Draft" to the output? Should we add "Draft" to every draft document? e.g.

IEEE Unapproved Std P277/D2,Mar 2007
IEEE Unapproved Std P487/D7 Feb 2007
IEEE Unapproved Std P495/D12 Mar 2007
IEEE Unapproved Std P802.16g/D8, Feb2007
IEEE Unapproved Std P802.1ag/D8, Feb 2007
ronaldtse commented 2 years ago

I did a search for "IEEE Unapproved Std P277/D2,Mar 2007" and got this: https://ieeexplore.ieee.org/document/4152680

Screenshot 2022-03-22 at 9 52 41 AM

The full title is apparently: "P277/D2,Mar 2007 - Unapproved Draft IEEE Recommended Practice for Cement Plant Power Distribution"

In fact, the original raw XML data for this entry is this:

<publication>
  <title><![CDATA[IEEE Unapproved Std P277/D2,Mar 2007]]></title>
  <normtitle><![CDATA[IEEE Unapproved Std P277/D2,Mar 2007]]></normtitle>
  <standardsfamilytitle>IEEE Recommended Practice for Cement Plant Power Distribution</standardsfamilytitle>
  <publicationinfo>
    <idamsid>0b000064807b4661</idamsid>
    <stdnumber>P277/D2,Mar 2007</stdnumber>
    <publicationtype>Standard</publicationtype>
    <publicationsubtype>Standard Docs</publicationsubtype>
    <standard_subtype>IEEE Standard</standard_subtype>
    <ieeeabbrev>IEEESTD</ieeeabbrev>
    <pubstatus>Active</pubstatus>
    <publicationopenaccess>F</publicationopenaccess>
    <standard_id>0</standard_id>
    <standard_status>Inactive</standard_status>
    <standardmodifierset>
      <standard_modifier>Draft</standard_modifier>
    </standardmodifierset>
    <packagememberset>
      <packagemember>STDSELECT</packagemember>
    </packagememberset>
    <standard_family>277</standard_family>
    <standardpackageset>
      <standard_package>3000 Standards Collection for Industrial and Commercial Power Systems</standard_package>
    </standardpackageset>
    <icscodes>
      <code_term codenum="91.100.10">Cement. Gypsum. Lime. Mortar</code_term>
    </icscodes>
    <pubtopicalbrowseset>
      <pubtopicalbrowse>Power, Energy and Industry Applications</pubtopicalbrowse>
    </pubtopicalbrowseset>
    <copyrightgroup>
      <copyright>
        <year>2007</year>
        <holder>IEEE</holder>
      </copyright>
    </copyrightgroup>
    <publisher>
      <publishername>IEEE</publishername>
      <address>
        <country>USA</country>
      </address>
    </publisher>
    <holdstatus>Hold</holdstatus>
    <confgroup>
      <doi_permission>F</doi_permission>
    </confgroup>
    <amsid>4152678</amsid>
  </publicationinfo>
    <article>
      <title><![CDATA[Unapproved Draft IEEE Recommended Practice for Cement Plant Power Distribution]]></title>
      <articleinfo>
        <articleseqnum>1</articleseqnum>
        <idamsid>0b000064807b4665</idamsid>
        <articlestatus>Active</articlestatus>
        <articleopenaccess>F</articleopenaccess>
        <articleshowflag>F</articleshowflag>
        <articleplagiarizedflag>F</articleplagiarizedflag>
        <articlenodoiflag>F</articlenodoiflag>
        <articlecoverimageflag>F</articlecoverimageflag>
        <articlereferenceflag>F</articlereferenceflag>
        <articlepeerreviewflag>F</articlepeerreviewflag>
        <holdstatus>Publish</holdstatus>
        <articlecopyright holderisieee="Yes" year="0"/>
        <date datetype="OriginalPub">
          <year>2007</year>
        </date>
        <size>330403</size>
        <filename docpartition="5" filetype="MainPDF">04152680.pdf</filename>
        <artpagenums endpage="" startpage=""/>
        <amsid>4152680</amsid>
      </articleinfo>
    </article>
  </volume>

If you look at the <stdnumber>, the "Unapproved..." text is not present.

But look at the discrepancy between the title values of <publication> vs the <articleinfo>:

<publication>
  <normtitle><![CDATA[IEEE Unapproved Std P277/D2,Mar 2007]]></normtitle>
</publication>
<!--vs-->
  <volume>
    <article>
    <title><![CDATA[Unapproved Draft IEEE Recommended Practice for Cement Plant Power Distribution]]></title>
    </article>
  </volume>

Interestingly, the publication says "Unapproved Std" but the article says "Unapproved Draft".

The "Unapproved" part is not documented in the XML at all.

I found the following two files that are:

Archive.zip

Notice this diff:

<   <title><![CDATA[IEEE Unapproved Std P1076.1/D3.3, Feb2007]]></title>
<   <normtitle><![CDATA[IEEE Unapproved Std P1076.1/D3.3, Feb2007]]></normtitle>
---
>   <title><![CDATA[IEEE Approved Draft Std P1076.1/D3.3, Feb 6, 2007]]></title>
>   <normtitle><![CDATA[IEEE Approved Draft Std P1076.1/D3.3, Feb 6, 2007]]></normtitle>
8,9c8,9
<     <idamsid>0b000064807b466f</idamsid>
<     <stdnumber>P1076.1/D3.3, Feb2007</stdnumber>
---
>     <idamsid>0b000064808ffb04</idamsid>
>     <stdnumber>P1076.1/D3.3, Feb 6, 2007</stdnumber>
24c24
<     <isbn isbntype="New-2005" mediatype="Electronic">978-1-5044-2834-7</isbn>
---
>     <isbn isbntype="New-2005" mediatype="Electronic">978-1-5044-2833-0</isbn>
52c52
<     <amsid>4152684</amsid>
---
>     <amsid>4278971</amsid>
57c57
<       <idamsid>0b000064820da92c</idamsid>
---
>       <idamsid>0b000064820daaee</idamsid>
59c59
<         <amsid>4152685</amsid>
---
>         <amsid>4278972</amsid>
64c64
<       <title><![CDATA[Unapproved IEEE Draft Standard VHDL Analog and Mixed-Signal Extensions (Revision of IEEE Std 1076.1-1999)]]></title>
---
>       <title><![CDATA[Approved IEEE Draft Standard VHDL Analog and Mixed-Signal Extensions (Revision of IEEE Std 1076.1-1999)]]></title>
67c67
<         <idamsid>0b000064807b4673</idamsid>
---
>         <idamsid>0b000064808ffb08</idamsid>
81,82c81,82
<         <size>6226635</size>
<         <filename docpartition="5" filetype="MainPDF">04152686.pdf</filename>
---
>         <size>6215274</size>
>         <filename docpartition="5" filetype="MainPDF">04278973.pdf</filename>
84c84
<         <amsid>4152686</amsid>
---
>         <amsid>4278973</amsid>

There is no difference in any of the statuses.

This tells me that the status of "Approved vs Unapproved" is not encoded in the XML data, and is only available in the PubID. Maybe we should store the parsed status of "Approved" and "Unapproved" and only display it out in the "full PubID style".

mico commented 2 years ago

This tells me that the status of "Approved vs Unapproved" is not encoded in the XML data, and is only available in the PubID. Maybe we should store the parsed status of "Approved" and "Unapproved" and only display it out in the "full PubID style".

What about word "Draft"? Should we display it only for "full PubID style" as well? And for every draft document? (documents with /D suffix)

ronaldtse commented 2 years ago

@mico I think you are right:

  • Should we display it only for "full PubID style" as well? I think so.

  • And for every draft document? (documents with /D suffix) Yes. For PubIDs that do not have these statements:

  • "Unapproved Draft"

  • "Unapproved Draft Std"

  • "Active Unapproved Draft Std"

  • "Approved Draft"

  • "Approved Draft Std"

We only know if it is a "Draft", but we do not know "Approved vs Unapproved" and whether it is "Active".

mico commented 2 years ago

Found a new pattern to parse: "/D{\d+}+{\d+}"

IEEE 1647/D8+3, December 2010
IEEE P1031/D1+1, August 2010
IEEE P463/D1+1, May 2013
IEEE P751/D2+1, May 2018
P1857/D1+1, July 2012
P2745.1/D4+1, April 2019
ronaldtse commented 2 years ago

@mico I checked online but know what these mean...

There is one instance of CEI/IEC 61000-4-15:1997+A1:2003, which means it is CEI/IEC 61000-4-15:1997 with Amendment 1 (A1:2003), the + here means "combined with Amendment 1". This instance is IEC practice.

mico commented 2 years ago

@ronaldtse another identifier I don't know how to parse: "PC37.30.2/D043 Rev 18, May 2015" Is it a revision of Draft? Any ideas how I should represent it?

mico commented 2 years ago

@ronaldtse "IEEE P1680.4_D1 and NSF/ANSI 426, August 2016" should I represent it as "IEEE P1680.4-2016/D1 (NSF/ANSI 426)"? Better ideas?

Upd.: it should be "IEEE P1680.4/D1 (NSF/ANSI 426), August 2016" or "IEEE P1680.4/D1, August 2016 (NSF/ANSI 426)"

ronaldtse commented 2 years ago

Probably "IEEE P1680.4/D1, August 2016 (NSF/ANSI 426)"?

NSF is the "National Sanitary Foundation" which issues food safety and hygiene standard. They use PubIDs like "NSF/ANSI/CAN 61", "NSF/ANSI 61-2021", "NSF/ANSI 336-2011".

I think PC37.30.2/D043 Rev 18, May 2015 is a revision of a draft, yes. Maybe PC37.30.2/D43R18, May 2015. This is similar to the other patterns like:

IEEE P11073-10101/D3r7, September 2018
IEEE P11073-10101/D4r1, January 2019
IEEE P11073-10101/D5r4, February 2019
IEEE P11073-10101/D7r1, March 2019
IEEE P11073-10101/D9r1, April 2019
mico commented 1 year ago

@ronaldtse IEEE Unapproved Std PC37.101/D13, Jun 2006 - as we can see, this is draft, so does it have missing "Draft" in the identifier? Should we add "Draft" to the output, so the result will be IEEE Unapproved Draft Std PC37.101/D13, Jun 2006?

Update: I think something wrong with source data: https://github.com/metanorma/pubid-ieee/blob/main/spec/fixtures/pubid-parsed.txt#L5705

ronaldtse commented 1 year ago

@mico we just received clarification from IEEE:

If it's an "unapproved draft" it is not a standard yet, so none of them should have "Std" included. I'm sure many do though -- the working groups work on revisions based on the published version and are not necessarily aware of these subtleties.

i.e. "Unapproved Std" or "Unapproved Draft Std" should not have "Std".

mico commented 1 year ago

@mico I checked online but know what these mean...

There is one instance of CEI/IEC 61000-4-15:1997+A1:2003, which means it is CEI/IEC 61000-4-15:1997 with Amendment 1 (A1:2003), the + here means "combined with Amendment 1". This instance is IEC practice.

@ronaldtse are you saying here that IEEE 1647/D8+3 is a Draft 8 + Amendment 3?

mico commented 1 year ago

@ronaldtse I'm struggling with identifier "ISO/IEC/IEEE P26513_D2, January 2017". This seems to be ISO identifier, but IEEE format and have IEEE's draft part.

There is no way to reformat it to "ISO" format without losing "draft" part. So, seems the output should be "ISO/IEC/IEEE Draft Std P26513/D2, January 2017". Am I right?

Upd.: I found another challenging identifier: "P82079-1_D4_FDIS" – definitely "ISO" identifier, because of the stage, but with IEEE's draft. Should we also have ISO/IEEE mixed format (where we render identifier in ISO format, but allow draft to be rendered in IEEE format)?