sslarch / caa2020_hackathon

Repostitory for a CAA2020 session
6 stars 0 forks source link

Tool to help document archaeological data #3

Open zackbatist opened 5 years ago

zackbatist commented 5 years ago

An interface that prompts users to document various aspects of their datasets and highlight or explain the implicit relationships between tables or variables. May be especially valuable for organizing series of scattered spreadsheets.

  1. User selects a series of spreadsheets, MS Access databases or SQL databases, PDFs of blank physical recording sheets, images of blank physical tags, directory structures, etc
    • For physical recording sheets and tags, user selects and highlights the fields that are represented on those media, which are then included in subsequent stages
  2. Prompt to delimit scope
    • Written explanation, i.e. this documentation explains the extensive portion of our database that deals with lithics processing
    • Select which tables or components to include
    • The scope of file directories, what they are meant to contain (e.g.: /trench001/ contains all files pertaining to trench 001 including trench report and section drawings, /trench001/context0001 contains all info pertaining to context 0001 including folder for pics, special finds from within context 0001, etc)
    • Notes regarding the project as a whole, why the data is being collected, what kinds of work can or will be done with it, etc.
  3. Prompt to document each table
    • Why does it exist? What is it meant to include?
    • Who contributes to it? Provide names and contact info
    • How are these tables populated? e.g.: web forms, physical recording sheets copied over, API access, etc
  4. Prompt to document the variables for each table
    • Identify and explain the composition of indexes, why indexes require or do not require unique values, etc
    • Identify and explain the reasons behind each relationship between table indexes (populated based on fuzzy searches to similarly named variables elsewhere)
    • Identify implicit groupings among dependent variables, e.g.: if different survey point collection methods have different variables associated with them (dog leash samples are associated with values in radius variable, whereas grab samples are not)
    • If values are selected from a preset list, what does each value in the list represent?
  5. Contact info of key personnel in charge of managing the project and its data
  6. Generate fancy visualizations and reports
    • Colour-coded variable groupings
    • Flow charts representing the drawn-out processes through which data is filled in to various tables
nevrome commented 5 years ago

Thank you for this interesting idea @zackbatist. The underlying problem of meta documentation is IMHO very real and important.

I do have difficulties though to see, how the interface of this software would work. Each archaeological project produces very different kinds of data and each archaeologist uses another personal workflow to manage this data. Where can these prompts appear and how can this metadata be stored permanently in a human- and machine-readable way?

I wonder if not a new piece of software, but a well described workflow -- a how-to guide to produce metadata -- would be a better and more universal solution? This would be very well compatible with the aims of this session.

I got the feeling that this proposal could be interesting for you, @florianthiery.

zackbatist commented 5 years ago

I had a chance to think about this some more. I envision this sort of as a post-hoc documentation tool, partially inspired by tools that help generate data management plans such as DMPonline and Portage, which are meant to prompt researchers to think about how they will handle data before they begin their work (though they are laughably ineffective because they are unenforceable and lack sufficient specificity).

I think that this would be most useful for projects that keep data stored across a series of vaguely-named excel spreadsheets, which is still quite common practice. Links between spreadsheets are inherently implicit, and this would simply force the user to document them in a more explicit way.1 This would make it easier to do manual queries, especially for people who are reusing shared data that they were not involved in creating. Therefore the goal is human readability.

A user would identify a series of excel files to be read by the system. The system doesn't care what the files' contents are. The user would be prompted to explain the significance or scope of this group of files, which could represent a relevant subset of data from a bigger project (i.e. lithics data from archaeological project x). The script would read the name of each file and prompt the user to explain or describe its scope. If there are multiple worksheets, the user will be prompted to describe each one. Users will also be able to provide contact info for the people who created or maintained each file.

Then the script would dive deeper into each worksheet by parsing the values stored within them. Column names would be identified with user assistance (is first row column names? y/n). The values under each column would then be read and if there seem to be repeated/standardized values (shorthand, abbreviations, etc) then the system would prompt for their meanings to be defined more clearly. Index columns (i.e. independent variables) would be identified and their relations to other indexes in other input spreadsheets would be declared and explained in a subsequent stage. I imagine that excel formulas might also be parsed somehow.

The result would be a brief and professional-looking report. Much of this can be done using base R, dplyr and whatever excel-parsing package is most extensive or up to date, all wrapped up as a shiny app. But because the imagined user would be someone who hasn't bothered to explicitly relate their data, stacking this on R (requires installing and launching R first, if run locally) might be the wrong path to take. It would be better to code this in python and then wrapping it as an application bundle using py2app or whatever the equivalent may be for windows systems.

1 However, it would also be very useful for documenting SQL databases, since the reasons for making various database design decisions are actually rarely documented, at least in my experience. To keep things simple, it's best to focus on the scattered spreadsheets scenario to start with.

MartinHinz commented 5 years ago

I think this would be a very valuable project that goes hand in hand with other efforts to create a more standardized workflow for processing (archaeological) data. It would certainly be very useful, and would integrate well into an overall analysis tool for standardized evaluations, e.g. in the sense of SDS processing as already implemented by Clemens, or as a revival of an old project corpse such as quantAAR.

However, I am not sure whether such an extensive project could be implemented within a hackathlon on the CAA. Especially since I think that some conceptual groundwork would have to be done in order to create a meaningful and sustainable interface.

But I would be very pleased if we would take this as a general project idea and start and implement it inshallah as soon as there is time for that.

nevrome commented 5 years ago

@zackbatist Thanks for this explanation. I understand it better now and agree with @MartinHinz that this might be very useful if implemented in a good way. :+1:

I'm not sure though, if it makes sense at this point to distinguish between projects that are sufficiently simple or too complex for the session. Pretty much all of the ideas we have so far can't be fully realized in the handful of hours we have together. I see this session more as a kickstarting event to discuss and develop first prototypes and to gather collaborators for future work.