relaton / relaton-ccsds

Implement relaton-ccsds for www.ccsds.org Standards
1 stars 0 forks source link

pubid-ccsds integration #7

Closed andrew2net closed 1 month ago

andrew2net commented 4 months ago

The class parses all the CCSDS docs https://github.com/relaton/relaton-ccsds/blob/main/lib/relaton_ccsds/data_parser.rb and save the in the https://github.com/relaton/relaton-ccsds/blob/main/lib/relaton_ccsds/data_fetcher.rb all IDs are in the index file https://github.com/relaton/relaton-data-ccsds/blob/main/index-v1.yaml

We need to parse all the IDs in the document parser and generate a new index with IDs as Hash instead of String. For example, the current index contains:

- :id: CCSDS 230.2-G-1
  :file: data/CCSDS-230-2-G-1.yaml
  ...

The new index should be like:

- :id: 
    :publisher: CCSDS
    :number: 230
    :part: 2
    :book: G-1
  :file: data/CCSDS-230-2-G-1.yaml
  ...

These ID parts are an example. I don't know what output the Pubid::Ccsds::Identifier.parse(id).to_h produces. So whatever Hash object it produces, we need to use it as an id value.

When the new index is created we need to update searching documents to use the new index.

andrew2net commented 3 months ago

The DocumentIdentifier#id returned String before and it returns Hash now. The GHA fails for this reason. We shouldn't change existed methods` behavior. @mico we need to rollback the changes made here as soon as possible.

That you need to do here is:

  1. Add new index. We already have an index and you shouldn't change it. We should to keep the index for a while for previous relaton-ccsds releases: https://github.com/relaton/relaton-ccsds/blob/0015be2ba88e6fdb170873962b618048ecada84d/lib/relaton_ccsds/data_fetcher.rb#L41-L43 so you need to add index-v2.yaml
  2. Generate the new index. Currently id is used as a key in index https://github.com/relaton/relaton-ccsds/blob/0015be2ba88e6fdb170873962b618048ecada84d/lib/relaton_ccsds/data_fetcher.rb#L120 For the new index you need to transform the id String -> Hash using pubid-ccsd, then use it as a key.
  3. Update search. https://github.com/relaton/relaton-ccsds/blob/0015be2ba88e6fdb170873962b618048ecada84d/lib/relaton_ccsds/hit_collection.rb#L12 The index.search text just compare text with string id in the index. We need to replace it with something like:
    pubid_ref = PubidCcsds::Identifier.parse(text)
    rows = index.search do |row|
    pubid = PubidCcsds::Identifier.new(row[:id])
    pubid == pubid_ref
    end

    The goal is to get best result with incomplete references. For example there are two editions CCSDS 103.0-B-1 and CCSDS 103.0-B-2. If we use reference without edition CCSDS 103.0-B we should get two rows form the index and then pick up the last one.

mico commented 3 months ago

@andrew2net @ronaldtse

pubid_ref = PubidCcsds::Identifier.parse(text)
rows = index.search do |row|
  pubid = PubidCcsds::Identifier.new(row[:id])
  pubid == pubid_ref
end

I believe search should be implemented in Relaton::Index::Type One case might be an adding functionality to use custom search function, for example implementing Relaton::Index::Type#set_custom_search_function(custom_search_function) method, so it will make it possible to assign a custom search method to index assigned to provider:

index = Relaton::Index.find_or_create :ccsds, url: "#{GHURL}index-v1.zip", file: INDEX_FILE
index.set_custom_search_function(
  -> (row) { PubidCcsds::Identifier.new(row[:id]) ==  PubidCcsds::Identifier.parse(text) }
)

Another case is implementation Relaton::Index::Type type for each identifiers provider (CCSDS in our case) where we implement separate search function.

Another idea is to implement pubid usage in Relaton::Index::Type by default, because we are planning use pubid everywhere in the future.

andrew2net commented 3 months ago

@mico what are the benefits of using set_custom_search_function over search with block?

ronaldtse commented 3 months ago

Using a block with search achieves exactly the same effect, right?

index = Relaton::Index.find_or_create :ccsds, url: "#{GHURL}index-v1.zip", file: INDEX_FILE
index.search do |row|
  PubidCcsds::Identifier.new(row[:id]) ==  PubidCcsds::Identifier.parse(text)
end

(This really is just the Enumerable#detect/select methods?)

andrew2net commented 3 months ago

@ronaldtse We need to use block because we need to create Pubid object inside the block. The relaton-index doesn't know what type of Pubid we need to create, and I don't think we should implement this logic in it. Also in other Relaton flafor gem we need to exclude some ID elements from comparation like edition, part etc. For example when we need to get all parts doc we should ignore parts in the block.

mico commented 3 months ago

Using a block with search achieves exactly the same effect, right?

@ronaldtse Yes, but do you think it's ok to use block every time when I just need to search by identifier's string (most common case):

index.search { |row| PubidCcsds::Identifier.new(row[:id]) == PubidCcsds::Identifier.parse(text) }

Instead of just:

index.search(text) }

I'm fine with that, just trying to cover obvious scenarios without code duplication when possible.

andrew2net commented 3 months ago

@ronaldtse Yes, but do you think it's ok to use block every time when I just need to search by identifier's string (most common case):

@mico Usually, we use search once within a gem, in cases where it is used more than once we need to exclude some parts of ID form comparison. So we need to use a block anyway.

ronaldtse commented 3 months ago

Actually I believe that having a simpler method for people to use the index is a good thing. Let’s allow a text search and if people want to do complex comparisons they can use a block?

ronaldtse commented 3 months ago

BTW, a search by default means the results returned is in an array.

andrew2net commented 3 months ago

Actually I believe that having a simpler method for people to use the index is a good thing. Let’s allow a text search and if people want to do complex comparisons they can use a block?

I agree, but it's not related to this issue. We need to use block in relaton-ccsds because we need to disable edition if a given reference doesn't have an edition, so we need to find all editions of the document.

andrew2net commented 2 months ago

@mico why can't we just use search with block here?