Open tlongers opened 2 years ago
bump @hancush
Hi, @tlongers, sharing a relevant email from late last year where we pondered this very question together:
The revised source import loops over each row in the sources sheet. First, creates or retrieves and updates an existing access point, based on "source:access_point_id:admin". Then, it creates or retrieves and updates the implicated source based on the combination of fields listed in Sources, below, and associates it with the access point.
If the source fields are not harmonized within records referring to the same source, then we'll see multiple versions of that source in our data. Referring back to the example in the current sheet, we have two sources for "By All Means Necessary", one with a publication date and one without, and the access points are split between those versions.
We use "source:access_point_id:admin" to create or retrieve an existing access point to relate to the source. Am I understanding you correctly that it, alone, does not uniquely identify an access point?
I agree a unique identifier for sources would be amazing! We actually have one in our data model already, so it'd be a matter of updating it (or, perhaps more easily, flushing and re-importing all sources) if/when it becomes available on your end.
Barring that, the fields we use to resolve sources are:
Aha, the Tom and Hannah of the past were wise and solved this issue already. Thanks; we'll probably implement this our side.
Wise then, shudder to think what we are now, @tlongers 😂
Thanks, we'll sort this out and let you know when we're done with it.
@hancush would you be able to do:
@tlongers This code creates the sources: https://github.com/security-force-monitor/sfm-cms/blob/master/sfm_pc/management/commands/import_country_data.py#L1389-L1433
I queried the production database and counted 11,968 unique sources.
Thanks @smcalilly
This is fixed now in the source
model in source:source_id:admin
. This provides a unique stable identifier for a source. What's required in sfm-cms
to use these values rather than infer an identity using the alg Hannah described above?
@hancush In our model we assign an UUID to source access point but not the source itself:
Access points are citations of specific parts of a source, and we assign them a stable UUID e.g. page 54 of Source 1 has a different access point (and uuid) to page 68 of Source 1. We don't, however, assign a stable UUID to Source 1.
Although
sfm-cms
draws uuids for access points from our import sheets, it also assign a uuid to the source. Check here, for example, using thesfm-cms
our long neglected "sources" view (login required):https://back.securityforcemonitor.org/en/source/view/079ddd1a-55c4-4694-902a-f6287a2ca09b/1da0094b-02fe-4b4f-a87c-84df1414bea8/#evidence
The URL displays access point
1da0094b-02fe-4b4f-a87c-84df1414bea8
, the record for which contains the following data:However it also assigns
079ddd1a-55c4-4694-902a-f6287a2ca09b
to the source, which in this case is the document calledBAHRAIN – M270 MULTIPLE LAUNCH ROCKET SYSTEMS (MLRS) UPGRADE
.How is it doing this, and does it repeat the process each time data are imported? Is there a requirement for source uniqueness inside
sfm-cms
that is being unmet here, and that we should fill by assigning a stable UUID to each source (and not only it access points)?