Closed ronaldtse closed 2 years ago
We should parse this list instead of #2 .
Especially for the dual/triple published standards, only the normtitle
provides that information.
RegEx expression to parse the normtitles https://regex101.com/r/yz98W9/4
@ronaldtse
EEE Std 1671.1-2017 (Revision of IEEE Std 1671.1‐2009)
EEE Std 1671.1-2017 (Revision of IEEE Std 1671.1-2009)
IEEE Std P1671/D5, June 2006
and IEEE Std P1671/D5, Jun 2006
. They associate 2 distinct documents. The only difference is that one has a full month name and the other short. I think we need to make them be various in another way.IEEE Std PC37.100.1/D8, Dec 2006
and IEEE Std PC37.100.1/D8, Dec2006
. They difference is break space between month and year.ANSI/IEEE Std
identifier doesn't have a number. How should we handle it?EEE Std 488.2-1992
be IEEE Std 488.2-1992
?IEEE 1076-CONC-I99O
be 1990, not I990?
- These 2 normtitles look identical:
EEE Std 1671.1-2017 (Revision of IEEE Std 1671.1‐2009) EEE Std 1671.1-2017 (Revision of IEEE Std 1671.1-2009)
These seem to be two identical documents, and notice that they have a typo: "EEE" should have been "IEEE". Let's correct them as special cases.
- There are 2 normtitles
IEEE Std P1671/D5, June 2006
andIEEE Std P1671/D5, Jun 2006
. They associate 2 distinct documents. The only difference is that one has a full month name and the other short. I think we need to make them be various in another way.
The first is 04067148.xml, the second is 041524522.xml.
The first one is broken because it has this:
<standard_id>0</standard_id>
The second one has this:
<standard_id>3721</standard_id>
I think we want to drop the first one.
- And there are 2
IEEE Std PC37.100.1/D8, Dec 2006
andIEEE Std PC37.100.1/D8, Dec2006
. They difference is break space between month and year.
Similarly, the first one is 04141261.xml, the second is 04152567.xml.
The first one is broken because it has this:
<standard_id>0</standard_id>
The second one has this:
<standard_id>4169</standard_id>
Let's also get rid of the first one. We should drop all items with <standard_id>
value == 0
(and please document this).
- The
ANSI/IEEE Std
identifier doesn't have a number. How should we handle it?
Which one? Can you be more specific? Many documents start with ANSI/IEEE Std
.
- Should the
EEE Std 488.2-1992
beIEEE Std 488.2-1992
?
Yes. We should replace all /^EEE\s/
with IEEE\
- Should a year in
IEEE 1076-CONC-I99O
be 1990, not I990?
Yes. this is clearly a typo. We should fix this in a "cleaning" step.
Given the data errors I think we should first run a separate "cleaning stage" and then parse the identifiers.
- The
ANSI/IEEE Std
identifier doesn't have a number. How should we handle it?Which one? Can you be more specific? Many documents start with
ANSI/IEEE Std
.
Yes, many start with it but only one ends:
ANSI/IEEE PC63.7/D rev17, December 2014
ANSI/IEEE STD 185-1975 (Revision of IEEE Std 165-1947)
ANSI/IEEE Std
ANSI/IEEE Std 1-1986
ANSI/IEEE Std 1000-1987
Done for the moment. Going to create new task.
Full list. A lot cleaner than
stdnumber
.UPDATED: pubid-sorted.txt.zip