ietf-tools / relaton-data-ieee

3 stars 5 forks source link

Weird title “Redline” #3

Open strogonoff opened 2 years ago

strogonoff commented 2 years ago

https://github.com/ietf-ribose/relaton-data-ieee/blob/main/data/IEC_62531.2012_REDLINE.yaml#L9

I don’t suppose that title should be there, or it’s not consistent with other uses of title-main.

ronaldtse commented 2 years ago

@strogonoff REDLINE is a type of document that shows a diff between two versions. It is a legitimate document type separate from a standard.

In the case of IEEE, we are not yet able to fully parse all the titles because they are so varied in this dataset.

For example, this is a document:

There are too many combinations to handle for now.

ronaldtse commented 2 years ago

And yes, the title content is not entirely correct here:

https://github.com/ietf-ribose/relaton-data-ieee/blob/73318ae8d646de3d3a8ef76582397a6b53200584/data/IEC_62531.2012_REDLINE.yaml#L3-L14

title-main is not supposed to be Redline. The Redline part should be together with the content of title-intro, and the title-intro shouldn't be that either. We can see this is a problem with parsing, the split at the en-dash split the title into intro and main parts.

It's difficult to fix this dataset. The original data source from IEEE is too convoluted -- too many inconsistencies, not normalised, widely varying patterns.

ronaldtse commented 2 years ago

It probably should be something like this:

title:
- type: title-main
  content: 'Standard for Property Specification Language (PSL)'
  format: text/plain
- type: main
  content: 'IEC 62531:2012(E) (IEEE Std 1850-2010): Standard for Property Specification
    Language (PSL) - Redline'
  format: text/plain
type: standard-redline
docid:
- id: "IEC 62531:2012(E) Redline"
  type: IEC
- id: 978-0-7381-8094-6
  type: ISBN
docnumber: IEC 62531.2012 Redline

See what I mean? Too convoluted...

strogonoff commented 2 years ago

I see. This will be visible in the IETF BibXML service we deliver. Maybe more Ruby resources are needed so we can fix data processing by NY? Someone could help write tests perhaps, to take care of these edge cases specifically.

On 27 Nov 2021, at 2:22 AM, Ronald Tse @.***> wrote:

 It probably should be something like this:

title:

  • type: title-main content: 'Standard for Property Specification Language (PSL)' format: text/plain
  • type: main content: 'IEC 62531:2012(E) (IEEE Std 1850-2010): Standard for Property Specification Language (PSL) - Redline' format: text/plain type: standard-redline docid:
  • id: "IEC 62531:2012(E) Redline" type: IEC
  • id: 978-0-7381-8094-6 type: ISBN docnumber: IEC 62531.2012 Redline See what I mean? Too convoluted...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

ronaldtse commented 2 years ago

There are still over 900 edge/special cases in this data set to be fixed right now, out of nearly 10,000. It will be a continual data cleaning exercise -- we must prioritize the other work over this right now.

The baseline is: our Relaton data set is already miles better than the official IEEE dataset on their own website.

We have already invested undue amounts of resourcing to clean this particular dataset. Any further work should be officially done with IEEE to fix the source. It is rather pointless to fix this data set that might get further broken at the source.

ronaldtse commented 2 years ago

If you check the Relaton-IEEE code, you will see how many patterns we already handle. This dataset is way more inconsistent than the sources from the 180,000 entries of the IEC Electropedia (FYI @andrew2net ).

andrew2net commented 2 years ago

It probably should be something like this:

title:
- type: title-main
  content: 'Standard for Property Specification Language (PSL)'
  format: text/plain
- type: main
  content: 'IEC 62531:2012(E) (IEEE Std 1850-2010): Standard for Property Specification
    Language (PSL) - Redline'
  format: text/plain
type: standard-redline
docid:
- id: "IEC 62531:2012(E) Redline"
  type: IEC
- id: 978-0-7381-8094-6
  type: ISBN
docnumber: IEC 62531.2012 Redline

See what I mean? Too convoluted...

Removing - Redline is an easy but the identifiers in the beginning is not so easy to remove. It may have many variants. I don't hame much time to check it now. I've updated title in the relaton-ieee v 1.9.4

 title:
 - type: title-main
   content: 'IEC 62531:2012(E) (IEEE Std 1850-2010): Standard for Property Specification Language (PSL)'
   format: text/plain
 - type: main
   content: 'IEC 62531:2012(E) (IEEE Std 1850-2010): Standard for Property Specification
     Language (PSL) - Redline'
   format: text/plain
 type: standard-redline
 docid:
 - id: "IEC 62531:2012(E) Redline"
   type: IEC
 - id: 978-0-7381-8094-6
   type: ISBN
 docnumber: IEC 62531.2012 Redline