Data format - Githubissues

paucodeici / universal_isic

GNU General Public License v3.0

0 stars 0 forks source link

Data format #1

Open paucodeici opened 8 months ago

paucodeici commented 8 months ago

Need to decide a data format we will apply for each classification.

The industrial classification all works as tree.

So we should implement a tree format, with as many branches and leaves we want.

Such a data is often recursive (since we do not know the depth at construction) and need a Node class.

A node is simply:

class Node:
    parent: Node
    code: str
    description: str

The search people could do are:

search for a code to get a value
the contrary
get the list of all descendants of a upper level with different depth
get the upper level of one description/code with different depth

Other important elements are the links between classification.

For mapping from tree A to tree B we map each node of tree A to nodes of tree B. Three situations can occur:

Node in A has no mapping to any node in B
Node in A has one mapping to node in B and the contrary is true
Node in A can be mapped to many node in B

For generality we prefer to consider 1 and 3 as the only options.

It leads to a mapping as a dict where

dict[Node in A] = [Node 1 in B, Node 2 in B, ..., Node N in B]

Mapping from A to B.

Another way is to create edge, as

edge: node_1: node_2:

paucodeici commented 8 months ago

For the tree, we would need:

and 2. list of tuple code,value (not as efficient as a dict but we do not care...) => we can just use the list of Nodes??
in Node1 we put children to construct descendants
list of Node2 with parents

Would mean 2 data structures representing the same thing.

OR we just keep each nodes at their level to have a structure like level_0 = [Node_0] level_1 = [Node_1,Node_2,...] ...

It allows us when looking for upper level to simply loop through the ones in the upper level of the considered element.

And honestly, if optimization is a problem it can be tackled later...