security-force-monitor / sfm-cms

Platform for sharing complex information about security forces. Powers WhoWasInCommand.com
https://whowasincommand.com
10 stars 3 forks source link

[question] assignment of stable identifier for a source (and not only its access points) #809

Open tlongers opened 2 years ago

tlongers commented 2 years ago

@hancush In our model we assign an UUID to source access point but not the source itself:

Source 1 -> Access point A (e38161cb-f0f2-4fa6-94ff-2c99f96225ea)
Source 1 -> Access point B (1c0fc3ea-b5fa-4fcf-acb7-0f8cb6b3b829)
Source 1 -> Access point C (bf0b493d-6b56-4347-8935-8be63cc44fe3)

Access points are citations of specific parts of a source, and we assign them a stable UUID e.g. page 54 of Source 1 has a different access point (and uuid) to page 68 of Source 1. We don't, however, assign a stable UUID to Source 1.

Although sfm-cms draws uuids for access points from our import sheets, it also assign a uuid to the source. Check here, for example, using the sfm-cms our long neglected "sources" view (login required):

https://back.securityforcemonitor.org/en/source/view/079ddd1a-55c4-4694-902a-f6287a2ca09b/1da0094b-02fe-4b4f-a87c-84df1414bea8/#evidence

The URL displays access point 1da0094b-02fe-4b4f-a87c-84df1414bea8, the record for which contains the following data:

field value
source:comments:admin
source:status:admin 3
source:external_archive_sha_content:admin  
source:external_archive_sha_meta:admin  
source:access_point_id:admin 1da0094b-02fe-4b4f-a87c-84df1414bea8
source:type document
source:title BAHRAIN – M270 MULTIPLE LAUNCH ROCKET SYSTEMS (MLRS) UPGRADE
source:author  
source:url https://www.dsca.mil/press-media/major-arms-sales/bahrain-m270-multiple-launch-rocket-systems-mlrs-upgrade
source:created_timestamp  
source:uploaded_timestamp  
source:published_timestamp 2022-03-24
source:accessed_timestamp 2022-03-30
source:access_point_type archive
source:access_point_trigger  
source:archive_url https://web.archive.org/web/20220327065604/https://www.dsca.mil/press-media/major-arms-sales/bahrain-m270-multiple-launch-rocket-systems-mlrs-upgrade
source:archive_timestamp  
source:publication_country us
source:publication_name Defense Security Cooperation Agency
source:publication_id:admin 91eda787-dacd-41ee-93c0-e01e3120a28b

However it also assigns 079ddd1a-55c4-4694-902a-f6287a2ca09b to the source, which in this case is the document called BAHRAIN – M270 MULTIPLE LAUNCH ROCKET SYSTEMS (MLRS) UPGRADE.

How is it doing this, and does it repeat the process each time data are imported? Is there a requirement for source uniqueness inside sfm-cms that is being unmet here, and that we should fill by assigning a stable UUID to each source (and not only it access points)?

tlongers commented 2 years ago

bump @hancush

hancush commented 2 years ago

Hi, @tlongers, sharing a relevant email from late last year where we pondered this very question together:


Source import flow

The revised source import loops over each row in the sources sheet. First, creates or retrieves and updates an existing access point, based on "source:access_point_id:admin". Then, it creates or retrieves and updates the implicated source based on the combination of fields listed in Sources, below, and associates it with the access point.

If the source fields are not harmonized within records referring to the same source, then we'll see multiple versions of that source in our data. Referring back to the example in the current sheet, we have two sources for "By All Means Necessary", one with a publication date and one without, and the access points are split between those versions.

Access points

We use "source:access_point_id:admin" to create or retrieve an existing access point to relate to the source. Am I understanding you correctly that it, alone, does not uniquely identify an access point?

Sources

I agree a unique identifier for sources would be amazing! We actually have one in our data model already, so it'd be a matter of updating it (or, perhaps more easily, flushing and re-importing all sources) if/when it becomes available on your end.

Barring that, the fields we use to resolve sources are:

tlongers commented 2 years ago

Aha, the Tom and Hannah of the past were wise and solved this issue already. Thanks; we'll probably implement this our side.

hancush commented 2 years ago

Wise then, shudder to think what we are now, @tlongers 😂

tlongers commented 2 years ago

Thanks, we'll sort this out and let you know when we're done with it.

tlongers commented 2 years ago

@hancush would you be able to do:

smcalilly commented 2 years ago

@tlongers This code creates the sources: https://github.com/security-force-monitor/sfm-cms/blob/master/sfm_pc/management/commands/import_country_data.py#L1389-L1433

I queried the production database and counted 11,968 unique sources.

tlongers commented 2 years ago

Thanks @smcalilly

tlongers commented 1 year ago

This is fixed now in the source model in source:source_id:admin. This provides a unique stable identifier for a source. What's required in sfm-cms to use these values rather than infer an identity using the alg Hannah described above?