relaton / support

Internal repository support for Relaton
0 stars 0 forks source link

relaton-data-* v2 data #32

Closed CAMOBAP closed 3 months ago

CAMOBAP commented 3 months ago

Per @andrew2net

We need to versioning `relaton-data- repos. The new coming Relaton version has a significant update in the data model. So the new release will create files incompatible with the previous one. But we need to support the old version for a while. Can we create branches in relaton-data- repos for new file format and run GHA only for latest version?

Open questions

  1. Will crawler.rb stay the same?
  2. Maybe this already implemented on some repo as an reference?

Proposed solution

Because shedule event will not fired for non default branch, the single option is to produce both v1 and v2 data zip/yaml in default (main) branch

andrew2net commented 3 months ago

Open questions

  1. Will crawler.rb stay the same?

It will stay the same.

  1. Maybe this already implemented on some repo as an reference?

It's not implemented yet.

Proposed solution

Because shedule event will not fired for non default branch, the single option is to produce both v1 and v2 data zip/yaml in default (main) branch

@CAMOBAP The problem is that we need to keep the previous Relaton versions working for a while (until all users can update to the latest version). So the current data dir and index files should stand in their places. But for the new Relaton version, we need to have a different dataset. So if it's not possible to use a schedule for not default branch, we have to put another dataset in a different dir, like 1.18.1. We also need to run the GHA for the latest dataset version only. Does it make sense?

CAMOBAP commented 3 months ago

@andrew2net thanks for the answers, maybe you can share a code snipped for some flavor where it supports v2 (let's call it v2 for instance) and how to fetch v2 dataset version?

andrew2net commented 3 months ago

@CAMOBAP Relaton doesn't support versioned datasets at this moment. There is an upcoming data-model update, so the new Relaton release will use datasets incompatible with the current release. The current release fetches documents from GH using raw links. For example, to get an IEC index file the link https://github.com/relaton/relaton-data-iec/raw/main/index1.zip is used. To get a document file the link https://github.com/relaton/relaton-data-iec/raw/main/data/cispr_10_1971.yaml is used. The current snippet is:

...
GHURL = "https://raw.githubusercontent.com/relaton/relaton-data-iec/main/"
...
url = "#{GHURL}#{hit[:file]}"
resp = Net::HTTP.get URI(URL)
...

The new Relaton release could use GHURL = "https://raw.githubusercontent.com/relaton/relaton-data-iec/main/v1.19/" URL to fetch files with the new format. We are going to number the versions as a minor version of Relaton that involves the new format. For example, the next version of Relaton will be 1.19.0, so the new dataset version will be v1.19.

CAMOBAP commented 3 months ago

So to maintain the old dataset format we will call fetch on some freezed version of relaton-* (because we don't use relaton or relaton-cli directly to fetch, only specific relaton-*

A few more questions:

For now, It looks to me that we should do it on crawler.rb level, something like

--- a/crawler.rb
+++ b/crawler.rb
@@ -1,3 +1,5 @@
 # frozen_string_literal: true
+require 'rubygems'
+

 system("sudo apt-get install mdbtools")
@@ -7,3 +9,11 @@ mode = mode == "force" ? "-#{mode}" : ""

 require "relaton_3gpp"
 Relaton3gpp::DataFetcher.fetch("status-smg-3GPP#{mode}")
+
+# Update legacy data format
+Gem::Uninstaller.new("relaton_3gpp").uninstall
+Gem.clear_paths
+Gem::install("relaton_3gpp", "1.18")
+
+require "relaton_3gpp"
+Relaton3gpp::DataFetcher.fetch("status-smg-3GPP#{mode}")

Just a general idea, there also should be some logic to save those data models to separate dirs

@andrew2net any feedback/objections

andrew2net commented 3 months ago

So to maintain the old dataset format we will call fetch on some freezed version of relaton-* (because we don't use relaton or relaton-cli directly to fetch, only specific relaton-*

We don't need to update the old datasets, only the last one.

A few more questions:

  • Once relaton-* that support new format was released, should we create a new dataset files for future minor version changes i.e. v1.19, v1.20 etc.

I will create a new dataset when needed. It won't be too often. We need to find the best way to synchronize create a new dataset and release Relaton. The solution should prevent using a dataset with the wrong version of Relaton. I think we can set a dataset's dir in the crawler.yml and a gem version in the Gemfile.

CAMOBAP commented 3 months ago

I will create a new dataset when needed. It won't be too often. We need to find the best way to synchronize create a new dataset and release Relaton.

From my side, I just need API to:

The solution should prevent using a dataset with the wrong version of Relaton

To be honest I completely unaware of workflow with dataset what happens with it once it fetched by crawler and committed to corresponding relaton-data-* repo

I think we can set a dataset's dir in the crawler.yml and a gem version in the Gemfile

It will be nice to keep this logic in a single place and don't duplicate across crawler.rb files, maybe fetch API should do this proper dir structure including the version?

andrew2net commented 3 months ago

@CAMOBAP I just got that if we don't use branches to versioning the datasets then it doesn't need any update in GHA. I can handle this issue in crawler.rb and Relaton gems.