relaton / relaton-3gpp

MIT License
2 stars 0 forks source link

Implement Relaton 3GPP #1

Closed ronaldtse closed 2 years ago

ronaldtse commented 2 years ago

Data: http://xml2rfc.tools.ietf.org/public/rfc/bibxml-3gpp-new (there is a bulk download)

Data instance sample: http://xml2rfc.tools.ietf.org/public/rfc/bibxml-3gpp-new/reference.3GPP.55.236.xml

We need to develop a Relaton model, and then import the data.

ronaldtse commented 2 years ago

The model in LutaML UML:

(TODO: Add this to relaton-models)

// "Specs_GSM+3G" table
class Spec {
  type: SpecificationType
  specNumber: String
  published: Boolean
  title: String
  workingGroupformer: String
  workingGroupPrime: String
  workingGroupOther: String
  rapporteurId: Number
  remarks: String
  radioTechnology: RadioTechnology
  isCommonImsSpec: Boolean
  isInternal: Boolean
  withdrawn:: Boolean
  creationDate: DateTime
  updateDate: DateTime
  titleVerifiedDate: DateTime
  url: URL
}

// "Releases" table
class Release {
  code: String
  description: String
  shortDescription: String
  version2G: String
  version3G: String
  isDefunct: Boolean
  remarks: String
  wpmCode2G: String
  wpmCode3G: String
  freezeMeeting: String
  freezeStage1Meeting: String
  freezeStage2Meeting: String
  freezeStage3Meeting: String
  closeMeeting: String
  projectStart: DateTime
  projectEnd: DateTime
  previousRelease: Release  
}

// "Specs_GSM+3G_release-info" table
class SpecRelease {
  specId: Spec
  releaseId: Release
  specReleaseId: Number
  remarks: String
  withdrawn: Boolean
  creationDate: DateTime
  updateDate: DateTime
  freezeMeeting: String
  stoppedAtMeeting: String
}

enum SpecificationType {
  TR {
    definition {
      Technical Report
    }
  }

  TS {
    definition {
      Technical Specification
    }
  }
}

enum RadioTechnology {
  2G
  3G
  LTE
  5G
}
andrew2net commented 2 years ago

@ronaldtse

Data instance sample: http://xml2rfc.tools.ietf.org/public/rfc/bibxml-3gpp-new/reference.3GPP.55.236.xml

The sample is:

<reference anchor="3GPP.55.236">
  <front>
    <title>
      Specification of A8_V MILENAGE Algorithm: An example algorithm for the key generation function A8_V
    </title>
    <author>
      <organization>3GPP</organization>
    </author>
    <date year="2012" month="September" day="27"/>
  </front>
  <seriesInfo name="3GPP TS" value="55.236 11.0.0"/>
  <format type="HTML" target="http://www.3gpp.org/ftp/Specs/html-info/55236.htm"/>
</reference>
ronaldtse commented 2 years ago
andrew2net commented 2 years ago

The model in LutaML UML:

(TODO: Add this to relaton-models)

@ronaldtse I'll implement these classes in the relaton-3gpp gem but we need a grammar to generate RelatonXML files. Ping @opoudjis

UPD I think we don't need SpecRelease class. The "Specs_GSM+3G_release-info" table needed for many-to-many relations. We only need to know what Releases related to a Spec. So the SpecRelease's attributes could be moved to Release class. Suppose the Spec class should inherit form RelatonBib::BibliographicItem, so we can reuse it's attributes:

ronaldtse commented 2 years ago

UPD I think we don't need SpecRelease class

Good insight, this may indeed be a joining table. I agree that a Spec and the Release classes should both be citable RelatonBib::BibliographicItem items.

The only attribute I can't find is stoppedAtMeeting. Where does this go?

andrew2net commented 2 years ago

The only attribute I can't find is stoppedAtMeeting. Where does this go?

@ronaldtse it seems we need to move remarks, withdrawn, freezeMeeting, and stoppedAtMeeting to the Release, and add release: Release[0..*] to the Spec.

opoudjis commented 2 years ago

The overlap between Spec and BibItem is of course unacceptable, and Andrej was right to point it out. If the Spec and the Release are both citable BibItems, they are related through a relation, of type derivedFrom (I think; if not, complements).

That means that we need to indicate whether a bibitem is a spec or a release; I'm introducing those as docsubtypes.

Andrej, I'm trying to follow in this what you've already modelled.

... I continue to be annoyed at how poor a match these Lutaml classes are for Bibitem. This is a bunch of randomness.

class Spec {
  type: SpecificationType => bibitem/item/doctype
  specNumber: String => bibitem/docnumber
  published: Boolean => bibitem/status/stage = 'published'
  title: String => bibitem/title
  workingGroupformer: String => bibitem/ext/editorialgroup/technical-committee[@type = 'former']
  workingGroupPrime: String => bibitem/ext/editorialgroup/technical-committee[@type = 'prime']
  workingGroupOther: String => bibitem/ext/editorialgroup/technical-committee[@type = 'other']
  rapporteurId: Number => bibitem/docidentifier[@type = 'rapporteurId']
  remarks: String => bibitem/note
  withdrawn:: Boolean => bibitem/status/stage = 'withdrawn'
  creationDate: DateTime => bibitem/date[@type = 'created'] ; if the date is unrelated to the document, as Andrej believes, it should not be recorded at all
  updateDate: DateTime => bibitem/date[@type = 'updated'] ; if the date is unrelated to the document, as Andrej believes, it should not be recorded at all
  titleVerifiedDate: DateTime => bibitem/date[@type = 'confirmed'] ; that is my best guess anyway
  url: URL => bibitem/uri

  radioTechnology: RadioTechnology : goes to ext
  isCommonImsSpec: Boolean : goes to ext
  isInternal: Boolean : goes to ext
}

The following are contained within bibitem/relation[@type = 'derivedFrom']/bibitem/ext/release

// "Releases" table
class Release {
  code: String => bibitem/relation[@type = 'derivedFrom']/bibitem/docidentifier
  remarks: String => bibitem/relation[@type = 'derivedFrom']/bibitem/note
  description: String  => bibitem/relation[@type = 'derivedFrom']/bibitem/abstract
  shortDescription: String => bibitem/relation[@type = 'derivedFrom']/bibitem/note[@type = 'shortDescription']
  previousRelease: Release  => bibitem/relation[@type = 'derivedFrom']/bibitem/relation[@type = 'successorOf']/bibitem

  version2G: String
  version3G: String
  isDefunct: Boolean
  wpmCode2G: String
  wpmCode3G: String
  freezeMeeting: String
  freezeStage1Meeting: String
  freezeStage2Meeting: String
  freezeStage3Meeting: String
  closeMeeting: String
  projectStart: DateTime
  projectEnd: DateTime
}

and the grammar covering this in extensions:

include "isodoc.rnc" {

start = iso-standard

DocumentType = "TR" | "TS"

DocumentSubtype = "spec" | "release"

BibDataExtensionType =
        doctype, docsubtype?, editorialgroup, ics*, radiotechnology?, common-ims-spec?, internal?, release?

}

RadioTechnologyType = "2G" | "3G" | "LTE" | "5G"

radiotechnology = element radiotechnology { RadioTechnologyType }

common-ims-spec = element common-ims-spec { xsd:boolean }

internal = element internal { xsd:boolean }

release = element release {
  element version2G { text },
  element version3G { text },
  element defunct { xsd:boolean },
  element wpm-code-3G { text },
  element wpm-code-3G { text },
  element freeze-meeting { text },
  element freeze-stage1-meeting { text },
  element freeze-stage2-meeting { text },
  element freeze-stage3-meeting { text },
  element close-meeting { text },
  element project-start { xsd:date },
  element project-end { xsd:date  }
}

Have already attached to metanorma-model-iso

ronaldtse commented 2 years ago

I did some detailed investigation in the database.

The remarks column in SpecReleases contains very specific information to a Spec within a Release. i.e. SpecRelease is not a joining table, it contains unique information that applies to a Spec in a Release.

e.g.

Screenshot 2021-12-02 at 6 36 57 PM

This also means that a Spec at a different Release is actually a different document. e.g. the Spec 45.902 in Rel-6 and Rel-7 are differently versioned documents.

This means the model works like this instead:

Thus, SpecRelease actually represents a document! Let's call it a SpecDocument.

A "Spec" is pretty much like a SpecProject that produces multiple SpecDocuments.

I would suggest we make these changes:

I'm still not fully sure where the "Version" concept can be extracted from. You can see that on this page you can have multiple document versions within one Release: https://portal.3gpp.org/desktopmodules/Specifications/SpecificationDetails.aspx?specificationId=1080

Screenshot 2021-12-02 at 9 51 13 PM
andrew2net commented 2 years ago

The remarks column in SpecReleases contains very specific information to a Spec within a Release. i.e. SpecRelease is not a joining table, it contains unique information that applies to a Spec in a Release.

SpecReleases contains information but it also is a joint table. I.E. one Spec can have multiple Releases and one Release can have multiple Specs and they have. 2331 of 3918 Specs have more than 1 Release relation and 35 of 37 Releases have more than 1 Spec relation:

database['Specs_GSM+3G'].select { |s| database['Specs_GSM+3G_release-info'].select { |sr| s[:Number] == sr[:Spec] }.size > 1 }.size
=> 2331

database['Releases'].select { |r| database['Specs_GSM+3G_release-info'].select { |sr| sr[:Release] == r[:Release_code] }.size > 1 }.size
=> 35

The data is compact as a normalized relation DB but if we denormalize the data we will get a tremendous amount of Spec -> Release combinations. In the DB there are 3918 Specs. If we create a document from each Spec -> Release relation we will have 3918 + 17217 = 21135 documents:

database['Specs_GSM+3G'].map { |s| database['Specs_GSM+3G_release-info'].select { |r| r[:Spec] == s[:Number] }.size }.sum
=> 17217

I would suggest we make these changes:

  • A SpecDocument links to a Release (e.g. "Rel-9") and a Spec (e.g. "TS 45.902"). We can cite it as "3GPP TS 45.902:Rel-9"
  • A Spec (e.g. "TS 45.902") can contain multiple SpecDocuments (e.g. "TS 45.902:Rel-7", "TS 45.902:Rel-8", "TS 45.902:Rel-9")

@opoudjis proposes to use bibitem/relation for Releases. One Spec (bibtem) can have multiple Releases (bibitem/relations). Each relation has a description that possible to use for remarks. So in this case we can use a reference like "TS 45.902" to get Spec with all releases. If we need to get Spec with one Release cited as "TS 45.902:Rel-7" we can get "TS 45.902" and remove all Releases except "Rel-7". What do you say Ronald?

I'm still not fully sure where the "Version" concept can be extracted from. You can see that on this page you can have multiple document versions within one Release:

There are tables with a word version in names:

"2003-03-04_work-plan_web-export-version",
 "2001-10-04_version-value-to-character-map",
 "2003-01-22_latest-version-ETSI-published_step5_table",
 "2003-01-22_latest-version-ETSI-published_step5-1-R99-table",
 "2003-01-22_latest-version-ETSI-published_step5-1-Rel4_table",
 "2003-01-22_latest-version-ETSI-published_step5-1-Rel5-table",
 "2003-01-22_latest-version-ETSI-published_step5-1-Rel6-table",
 "2003-02-06_table03-versions_table",
 "2003-04-10_webexp04_release-and-version-details_table",
 "2003-04-10_webexp12_latest-versions-all-releases_table"

I'll investigate them tomorrow.

ronaldtse commented 2 years ago

In the DB there are 3918 Specs. If we create a document from each Spec -> Release relation we will have 3918 + 17217 = 21135 documents

This is actually correct. You can see this being shown here: https://portal.3gpp.org/desktopmodules/Specifications/SpecificationDetails.aspx?specificationId=1080

Screenshot 2021-12-03 at 12 02 40 PM

One Spec (bibtem) can have multiple Releases (bibitem/relations).

This is an inaccurate simplification.

Each Spec-Release document is a separate document. Each document has a version number. At every new release, there is a new version number even though the document content is identical:

Screenshot 2021-12-03 at 12 04 13 PM

I was able to find the versions, and the way to generate the URL, will post later.

ronaldtse commented 2 years ago

I finally found it!

SpecDocument information

SpecDocument is actually defined the table called 2001-04-25_schedule. (I know, funny name.)

Screenshot 2021-12-03 at 4 14 40 PM

These are the attributes we want from this table:

Here's a direct comparison of database rows against the web content:

Screenshot 2021-12-03 at 4 28 30 PM Screenshot 2021-12-03 at 4 28 03 PM

SpecDocument remark and withdrawn status

The Spec_GSM+3G_release-info table is what provides the "remarks" and "withdrawn" information in the web display:

Screenshot 2021-12-03 at 4 52 15 PM

For future consideration: related work items

The table 2008-03-08_Specs-vs-WIs is a joining table between Specs and Work Items (WIs).

image

For future consideration: committees

The table committees_local-names is a list of all committees:

Screenshot 2021-12-03 at 4 51 01 PM
andrew2net commented 2 years ago
  • A Spec (e.g. "TS 45.902") can contain multiple SpecDocuments (e.g. "TS 45.902:Rel-7", "TS 45.902:Rel-8", "TS 45.902:Rel-9")

@ronaldtse The relaton-* flavor gems don't hadle collections of documents. The relaton-cli does but not in this way. We can use bibitem/relation the same way as for an all parts document. Is it ok?

ronaldtse commented 2 years ago

@andrew2net yes, that's what I meant. Thanks for the clarification.

So we have three types of bibliographic items (citable items) in 3GPP:

andrew2net commented 2 years ago

@ronaldtse do we need these attributes in the relaton-3gpp-model?

// "Releases" table
class Release {
  code: String
  description: String
  shortDescription: String
  version2G: String
  version3G: String
  isDefunct: Boolean
  remarks: String
  wpmCode2G: String
  wpmCode3G: String
  freezeMeeting: String
  freezeStage1Meeting: String
  freezeStage2Meeting: String
  freezeStage3Meeting: String
  closeMeeting: String
  projectStart: DateTime
  projectEnd: DateTime
  previousRelease: Release  
}

I didn't mange to find how the "Releases" table is related to Specs. The table has fields:

database['Releases'][0]
=> {:Release_code=>"Ph1",
 :Release_description=>"Phase 1",
 :Release_short_description=>"Ph1",
 :version_2g=>"3",
 :version_3g=>"-",
 :"sort-order"=>"100",
 :defunct=>"1",
 :remarks=>"Release closed - no CRs permitted.",
 :wpm_code_2g=>"GSM_PH1",
 :wpm_code_3g=>nil,
 :"freeze meeting"=>"gsm-25b",
 :PROJECT_ID=>"705",
 :"rel-proj-start"=>nil,
 :"rel-proj-end"=>nil,
 :ITUR_code=>nil,
 :version_2g_dec=>"3",
 :version_3g_dec=>"-",
 :previousRelease=>nil,
 :Stage1_freeze=>"GSM-25b",
 :Stage2_freeze=>"GSM-25b",
 :Stage3_freeze=>"GSM-25b",
 :Protocols_freeze=>"GSM-25b",
 :Closed=>"SMG-17",
 :Field1=>nil}

The field :release=>"Rel-15" from "2001-04-25_schedule" table doesn't match with any :Release_code or :Release_description in the "Releases" table:

database['Releases'].detect { |r| [:Release_code] == "Rel-15" }
=> nil
database['Releases'].detect { |r| [:Release_description] == "Rel-15" }
=> nil
ronaldtse commented 2 years ago

The field :release=>"Rel-15" from "2001-04-25_schedule" table doesn't match with any :Release_code or :Release_description in the "Releases" table:

It is available though, in Release_code and also Release_short_description:

Screenshot 2021-12-04 at 2 00 04 AM
andrew2net commented 2 years ago

It is available though, in Release_code and also Release_short_description:

@ronaldtse you are right, it's my mistake

andrew2net commented 2 years ago

@ronaldtse not every row in the "2001-04-25_schedule" table has related rows in the "Specs_GSM+3G" and "Specs_GSM+3G_release-info" tables. Here are missing spec numbers:

schedule.select { |sc| spec.detect { |s| s[:Number] == sc[:spec] }.nil? }.map { |sc| sc[:spec] }.uniq
=> ["00.000",
 "00.001",
 "02.10U",
 "02.23U",
 "02.24U",
 "02.25U",
 "02.30U",
 "02.40U",
 "02.41U",
 "02.50U",
 "02.51U",
 "02.52U",
 "02.53U",
 "02.54U",
 "02.55U",
 "03.20U",
 "03.21U",
 "03.22U",
 "03.23U",
 "04.01U",
 "04.02U",
 "04.03U",
 "04.04U",
 "22.10U",
 "22.20U",
 "25.25U",
 "31.01U",
 "31.02U"]

We don't have some critical data without these relations, for example title is in the spec table only. Shoudn't we skip these documents?

ronaldtse commented 2 years ago

@andrew2net yes, let's skip this list of spec numbers. From a Google search, they don't seem to exist.

andrew2net commented 2 years ago

@ronaldtse

ronaldtse commented 2 years ago

Documents without URLs are fine.

For the duplicates, let me investigate the data and get back. For now just take the newest record.

andrew2net commented 2 years ago

@ronaldtse the "Specs_GSM+3G_release-info" table doesn't have some spec+release combination:

specrel.detect { |sr| sr[:Spec] == '30.531' && sr[:Release] == 'Rel-5' }
=> nil
ronaldtse commented 2 years ago
  • There are duplicates in the "2001-04-25_schedule" table. For example filter by spec: "03.20ext" release: "Ph1-EXT" returns 3 documents with same version "3.0.0

In the database:

Screenshot 2021-12-11 at 10 56 59 AM Screenshot 2021-12-11 at 10 57 05 AM

Here's the corresponding 3GPP page:

Screenshot 2021-12-11 at 10 56 22 AM

This means that we should treat every "Upload date" as unique.

Moreover when I click on "version details":

Upload date 2003-09-01:

Screenshot 2021-12-11 at 11 08 32 AM

Upload date 1995-01-01:

Screenshot 2021-12-11 at 11 08 45 AM

Some explanation:

Originally I thought we should set "upload date per version" as the uniqueness criteria. However, I found this entry of spec "00.02u": it contains no dates, no location.

Screenshot 2021-12-11 at 12 01 49 PM Screenshot 2021-12-11 at 12 01 58 PM

Then I realize that the missing rows seem to have a missing attribute 3guId: it is set to NULL.

This is validated by:

Conclusion:

ronaldtse commented 2 years ago

@ronaldtse the "Specs_GSM+3G_release-info" table doesn't have some spec+release combination:

Because this combination does not exist:

Look at the 3GPP page:

Screenshot 2021-12-11 at 12 12 20 PM

There is no Rel-5 (Release 5) for this spec.

andrew2net commented 2 years ago

Conclusion:

  • In 2001-04-25_schedule: Drop all entries with 3guId == NULL.
  • All other entries are valid

@ronaldtse not all. For example for TR 02.08:Ph1/0.2.1 there are rows with 3guId 639 and 5227; for TS 03.20ext:Ph1-EXT/3.0.0 there are 3 rows with 3guId 2885, 16103, and nil. Maybe we need to add the 3guId to reference?

ronaldtse commented 2 years ago

@andrew2net can you check the 3GPP portal to see what the results are?

I suspect these entries have different upload dates and are all valid (except for the NULL entry in 3guId column).

andrew2net commented 2 years ago

@ronaldtse it's very strange. In the DB there are two "0.2.1" versions with not nil 3guId for spec 02.08 and release Ph1 (Phase 1). For spec 02.08 and release Ph2 (Phase 2) there isn't any "0.2.1" version:

schedule.select { |sc| sc[:spec] == "02.08" && sc[:release] == "Ph1" && sc[:MAJOR_VERSION_NB] == '0' && sc[:TECHNICAL_VERSION_NB] == '2' && sc[:EDITORIAL_VERSION_NB] == '1' }.size
=> 2

schedule.select { |sc| sc[:spec] == "02.08" && sc[:release] == "Ph2" && sc[:MAJOR_VERSION_NB] == '0' && sc[:TECHNICAL_VERSION_NB] == '2' && sc[:EDITORIAL_VERSION_NB] == '1' }.size
=> 0

on the 3GPP portal one "0.2.1" version is displayed in "Phase 1" and one in "Phase 2" https://portal.3gpp.org/desktopmodules/Specifications/SpecificationDetails.aspx?specificationId=35

image
ronaldtse commented 2 years ago

Some further insights from the screenshot:

As @andrew2net mentioned, the magical thing is both records have a "release" value of Ph1. However in the screenshot these two entries split into "Phase 1" and "Phase 2"! How is that even possible?

Screenshot 2021-12-12 at 2 45 11 PM Screenshot 2021-12-12 at 2 45 31 PM

I found two places that have "02.08" listed as Phase 1 and Phase 2:

  1. In the "Specs_GSM+3G_release-info" table:
image
  1. In the "temp-status" table:
Screenshot 2021-12-12 at 2 36 41 PM Screenshot 2021-12-12 at 2 36 58 PM

Now downloading the latest 2021-12-05 DB to see if there are any changes...

ronaldtse commented 2 years ago

I downloaded the latest status_smg_3GPP_2021-12-06_15h15-CET but the data for 02.08 is unchanged, i.e. both records still have the value Ph1.

Notice that the heading for each release is directly from the "Specs_GSM+3G_release-info" table.

Screenshot 2021-12-12 at 2 57 22 PM

So they probably first loaded the Releases, and then load the specs per release.

ronaldtse commented 2 years ago

@andrew2net I am not sure if 3GPP has a different database, but in the 2001-04-25_schedule table we can see data from 2021-12-03, which is super recent. This would indicate that at least this table is likely authoritative. So I don't understand how they placed the same record in "Phase 1" vs "Phase 2" that way.

However, I just found this table: 2003-04-10_webexp04_release-and-version-details_table.

Screenshot 2021-12-12 at 5 40 31 PM

This data is correct! Yet the newest data for this table is only until 2013.

So it is very possible that there is another table (e.g. release-and-version-details_table) that contains the correct information, but it is not published.

ronaldtse commented 2 years ago

@andrew2net can we close this now? Thanks.

andrew2net commented 2 years ago

@ronaldtse we have an unresolved issue with releases and versions relation.

ronaldtse commented 2 years ago

I'm going to close this and leave the unresolved problem in a new issue.