spechub / Hets

The Heterogeneous Tool Set
http://hets.eu
GNU General Public License v2.0
57 stars 19 forks source link

in db, documents are duplicated #1844

Open tillmo opened 6 years ago

tillmo commented 6 years ago

If file:///home/till/temp/doc1.dol contains

spec sp = sort s end

and file:///home/till/temp/doc2.dol contains

from <file:///home/till/temp/doc1.dol> get sp
spec sp1 = sp then sort t end

and I analyse the two files with Hets in that order, then I get

select * from documents;
1 doc1                             doc1                               file:///home/till/temp/doc1.dol            
4 file:///home/till/temp/doc1.dol  <file:///home/till/temp/doc1.dol>  file:///home/till/temp/doc1.dol            
7 doc2                             doc2                               file:///home/till/temp/doc2.dol   

Could this duplication be avoided by using location to identify documents, even if name and display_name differ?

eugenk commented 6 years ago

This doesn't happen in Ontohub. Let me explain the current behaviour:

Without Ontohub (no --database-fileversion-id parameter provided)

When you call Hets to analyse a file, it creates a new row in the file_versions table. The file_versions.id field of the new row is used to associate a Document with a FileVersion (that's a database constraint of Ontohub). Since the FileVersions of your consecutive calls to Hets differ, there are multiple documents with the same location/LocId. See this SQL output:

 ❯ psql -U postgres -d hets_development -c 'SELECT * FROM documents AS sub INNER JOIN loc_id_bases ON sub.id = loc_id_bases.id INNER JOIN file_versions ON loc_id_bases.file_version_id = file_versions.id;'
 id |     display_name     |          name          |       location       | version | id | file_version_id |  kind   |  loc_id  | id | action_id | repository_id |         path         |     commit_sha
----+----------------------+------------------------+----------------------+---------+----+-----------------+---------+----------+----+-----------+---------------+----------------------+---------------------
  1 | doc1                 | doc1                   | file:///tmp/doc1.dol |         |  1 |               1 | Library | doc1.dol |  1 |         1 |             1 | file:///tmp/doc1.dol | non-git FileVersion
  4 | file:///tmp/doc1.dol | <file:///tmp/doc1.dol> | file:///tmp/doc1.dol |         |  4 |               2 | Library | doc1.dol |  2 |         3 |             1 | file:///tmp/doc2.dol | non-git FileVersion
  7 | doc2                 | doc2                   | file:///tmp/doc2.dol |         |  7 |               2 | Library | doc2.dol |  2 |         3 |             1 | file:///tmp/doc2.dol | non-git FileVersion

The Document with documents.id = 1 has file_versions.id = 1 while the other two documents have file_versions.id = 2. Please note that the --database-reanalyze option has no effect if --database-fileversion-id is not set because Hets always creates a new FileVersion and there is no data that can be overwritten for that new FileVersion.

With Ontohub (--database-fileversion-id=123456789 parameter provided)

Ontohub always tells Hets which FileVersion to use for the association with Documents and sub-Document models. If the same FileVersion is used for consecutive calls on the same Document, Hets fails unless the --database-reanalyze option is set. With this option, Hets deletes all data that is (recursively) associated with the FileVersion and saves the new analysis result for the given FileVersion.