This will likely be a long-lived (1-2 weeks) and uncharacteristically large PR. The project is at a state where nearly all pieces are working, but there is not a cohesive model and process for bringing those pieces together. I intend to do that in this PR. This will require several database migrations, code refactors, and UI updates.
In order to not cause conflicts on the dev branch while I much such migrations and refactors, I plan on keeping changes here until the full integration is updated and ready. I'll keep a rough task list at the bottom of this to keep folks appraised of progress, but it's just my notes and is subject to change - it's not an exhaustive or exact set of requirements/features. I'll also include descriptions of larger changes and refactors - those will likely be easier to follow than the large number of file changes.
Schemas
Schemas are currently a single field on the Collection model. In order to keep track of the history of schema edits, they need to be their own model. I'll be creating a Schema table and connecting its items to collections.
Uploads vs Publishing
One thing I noticed was misaligned was our model of a file upload being equated to a new published version. I expect folks will often have several files, or a file + manual edits, or some other combination of input that they want to make before publishing a new version. Essentially, there is a draft version that collects edits until the user is ready to publish. This requires us to track edits/uploads separately from versions (tracking this is also the basis for provenance).
In modeling this, I landed on having two types of thing:
Inputs which represent a specific contribution to the dataset, by a single person, at a single time, from a single source. We can have a single Items table that holds all of these and is connected to a collectionId.
Input Sources which represent a specific form-factor of input (e.g. CSV upload, JSON upload, Web UI edit, API request payload, etc). Each Input has a foreign key to a single InputSource. We can have many types of Input Sources (each their own table) which normalize the data about that type of input source. For example:
InputSourceCSV
uploadedFileUri
mapping
createdAt
InputSourceAPI
sourceIpAddress
payload
It was helpful for me to model the flow for a CSV upload as guidance:
<img width="1684" alt="Underlay CSV Flow" src="https://user-images.githubusercontent.com/1000455/166744635-c2f6c063-32de-4551-b396-be477898f32b.png">
## Running task list
As noted above, this is mostly for my notes, just an exhaustive or exact spec-list/roadmap. I'll update it each day based on the progress I've made.
- [x] Clean the Schema edit flow
- [x] Create `Schema` objects on save
- [x] Fix relationship bug
- [x] Default to static viewer with Edit button if there is already a schema
- [x] Build SchemaViewer for static version
- [x] Block editing if there is data - we need to integrate tasl migrations for this to work properly.
- [x] Clean up components
- [x] Update CSS
- [x] Refactor upload popover
- [x] Add nested design to schema-alignment component
- [x] Have 'complete' button generate `Input` and `InputSource` objects. Create the backend space for doing future
- [x] Process input into stored data file used for rendering.
- [x] Add reductionType options
- [x] Load data from generated file (e.g. processed draft or version file)
- [x] Allow version switches with dropdown
- [ ] Build structure for making Data queries/selections. It'll all be client at the moment, but eventually this will be where API calls go.
- [ ] Visualize Inputs
- [ ] Add button on entities display provenance viewer (just shows related `Inputs`).
- [x] Build Publish button and flow
- [ ] Build JSON export flow
- [x] Build alignment/selection tool
- [x] Generate cached file and create `Export` object with proper links, etc
- [x] Figure out how to get exports to auto-generate on version update
- [ ] Update Export Table to show real values, etc
- [x] Update CSS
- [ ] Update Overview page to properly hook into versions, schemas, etc
- [ ] update getting started tab, remove schema flat iamge
- [x] Make collection slugs have permanent suffix, this will be useful for routing export caches and in case collection/namespaces titles change
- [ ] Add discussions
- [ ] UI for creating
- [ ] Visualize on entity
- [ ] Improve collection preview design
- [x] Make `unique` field singular and `uniqueIdentifier`
- [ ] Implement settings pages content
- [ ] Improve collection header design
- [ ] Update landing page with improved language
This will likely be a long-lived (1-2 weeks) and uncharacteristically large PR. The project is at a state where nearly all pieces are working, but there is not a cohesive model and process for bringing those pieces together. I intend to do that in this PR. This will require several database migrations, code refactors, and UI updates.
In order to not cause conflicts on the
dev
branch while I much such migrations and refactors, I plan on keeping changes here until the full integration is updated and ready. I'll keep a rough task list at the bottom of this to keep folks appraised of progress, but it's just my notes and is subject to change - it's not an exhaustive or exact set of requirements/features. I'll also include descriptions of larger changes and refactors - those will likely be easier to follow than the large number of file changes.Schemas
Schemas are currently a single field on the Collection model. In order to keep track of the history of schema edits, they need to be their own model. I'll be creating a
Schema
table and connecting its items to collections.Uploads vs Publishing
One thing I noticed was misaligned was our model of a file upload being equated to a new published version. I expect folks will often have several files, or a file + manual edits, or some other combination of input that they want to make before publishing a new version. Essentially, there is a draft version that collects edits until the user is ready to publish. This requires us to track edits/uploads separately from versions (tracking this is also the basis for provenance).
In modeling this, I landed on having two types of thing:
Inputs
which represent a specific contribution to the dataset, by a single person, at a single time, from a single source. We can have a singleItems
table that holds all of these and is connected to acollectionId
.Input Sources
which represent a specific form-factor of input (e.g. CSV upload, JSON upload, Web UI edit, API request payload, etc). EachInput
has a foreign key to a single InputSource. We can have many types of Input Sources (each their own table) which normalize the data about that type of input source. For example:InputSourceAPI sourceIpAddress payload