paradisec-archive / nabu

nabu is a digital media item management system that provides a catalog of audio and video items, metadata for these items, and information about the workflow status of the items.
GNU General Public License v3.0
17 stars 8 forks source link

Datatype field is missing #205

Closed AmandaPDSC closed 11 years ago

AmandaPDSC commented 11 years ago

The datatype field is missing from items in Nabu - it is a repeating field with a controlled vocab that should be included in each item. It could be placed near "discourse type" but is a separate field

silviapfeiffer commented 11 years ago

The data type is on the individual files in the item. Since most items have more than one file associated with them, I don't think it's possible to give them a data type field. Or am I misunderstanding? What is the controlled vocab that you're referring to?

AmandaPDSC commented 11 years ago

This is a separate field at item level in the old catalogue - the vocab includes historical_text, instrumental_music, language_description, lexicon etc.

silviapfeiffer commented 11 years ago

@LindaBarwick I was under the impression that we wanted to drop that field. That's what I got from early discussions. Can you clarify please?

LindaBarwick commented 11 years ago

I don't recall any previous discussion of this. Obviously the DCMI datatype (sound, moving image etc) does belong at file level. There was some history behind why the DCMI datatype was mixed up with the linguistic datatype @nthieberger may remember.

nthieberger commented 11 years ago

We mentioned this in a message of Jan 12, it is 'linguistic-type' (http://www.language-archives.org/OLAC/1.1/olac-linguistic-type.xsd)

---------- Forwarded message ----------

Hi Silvia and John,

This is info for when you gt to the OLAC harvest. OLAC has a limited metadata set that we map our current metadata to using the attached php script. The following are OLAC terms, and we also can use DCMI terms as per the standard: http://www.language-archives.org/OLAC/metadata.html

Their terms, below, are not all useful in our catalog, so we have used those marked with an asterisk. Those marked with a % are mapped to from the term following it in the list below (so, 'lexicography' is one of our 'olac-linguistic-field' terms because it is mapped from 'lexicon' in the data type selection.)

I'm sure we'll need to discuss how this all works.

Thanks,

Nick

anthropological_linguistics applied_linguistics cognitive_science computational_linguistics discourse_analysis forensic_linguistics general_linguistics %historical_linguistics (mapped from data type = 'historical reconstruction') history_of_linguistics language_acquisition % language_documentation (all catalog items are currently assigned this term automatically) % lexicography (data type = lexicon) linguistics_and_literature linguistic_theories mathematical_linguistics morphology neurolinguistics philosophy_of_language phonetics phonology pragmatics psycholinguistics semantics sociolinguistics syntax text_and_corpus_linguistics translating_and_interpreting typology writing_systems *dialogue *drama *formulaic *ludic *oratory *narrative *procedural *report *singing *unintelligible_speech *language_description *lexicon *primary_text *annotator *author *compiler *consultant *data_inputter *depositor *developer *editor *illustrator *interpreter *interviewer *participant *performer *photographer *recorder *researcher *research_participant *responder *signer *singer *speaker *sponsor *transcriber *translator
silviapfeiffer commented 11 years ago

So, how do we resolve this?

Right now we have the field "Discourse Type" on each item, which can take these values:

+----+-----------------------+ | id | name | +----+-----------------------+ | 1 | drama | | 2 | formulaic_discourse | | 3 | interactive_discourse | | 4 | language_play | | 6 | narrative | | 5 | oratory | | 7 | procedural_discourse | | 8 | report | | 9 | singing | | 10 | unintelligible_speech | +----+-----------------------+

That covers what Nick mentions, IMO.

Also, we have roles on each contributor for items, which can take on these values:

+----+---------------+ | id | name | +----+---------------+ | 1 | author | | 2 | compiler | | 3 | consultant | | 4 | data_inputter | | 5 | depositor | | 6 | editor | | 7 | interviewer | | 8 | participant | | 9 | performer | | 10 | photographer | | 11 | recorder | | 12 | researcher | | 13 | speaker | | 14 | translator | | 15 | singer | +----+---------------+

So, I don't know what is missing.

I did just now notice that the OLAC feed is incomplete, so will have to fix that. I'll open a separate bug for that.

Silvia.

silviapfeiffer commented 11 years ago

On Fri, Sep 21, 2012 at 8:42 AM, Linda Barwick notifications@github.comwrote:

I don't recall any previous discussion of this. Obviously the DCMI datatype (sound, moving image etc) does belong at file level. There was some history behind why the DCMI datatype was mixed up with the linguistic datatype @nthieberger may remember.

At file level we have the data type as a mime type, e.g. image/jpeg, video/mpeg etc. Is this where we should add an additional field that would be provided by hand? All of the other fields are currently imported from the files themselves.

silviapfeiffer commented 11 years ago

Just checking the old system and I can see another field (as Amanda says).

Do we want to restrict the data types to just what Nick lists: language_description lexicon *primary_text

Or do we want to import all the values from the old system: +---------+---------------------------+ | type_id | type_name | +---------+---------------------------+ | 1 | Historical Reconstruction | | 2 | historical_text | | 3 | instrumental_music | | 4 | language_description | | 5 | lexicon | | 6 | photo | | 7 | primary_text | | 8 | song | | 9 | Typological Analysis | | 10 | Sound | | 11 | Movingimage | +---------+---------------------------+

We can add that to the item.

AmandaPDSC commented 11 years ago

Given that we have records that will use any of the values in that list, we should import all the values from the previous system

LindaBarwick commented 11 years ago

I think we actually need two extra fields. One for OLAC lingustic data type

4 | language_description | | 5 | lexicon | | 7 | primary_text |

another one for the OLAC Linguistic subject vocabulary %historical_linguistics (mapped from data type = 'historical reconstruction') | 1 | Historical Reconstruction | | 2 | historical_text |

% language_documentation (all catalog items are currently assigned this term automatically)

% lexicography (data type = lexicon) | 5 | lexicon |

we also need %typology | 9 | Typological Analysis |

In other words the contents of that field get dealt with as follows:

| 1 | Historical Reconstruction | > OLAC Linguistic subject: historical_linguistics | 2 | historical_text | > OLAC Linguistic subject: historical_linguistics | 3 | instrumental_music | IGNORE [there is no OLAC field to map this to, it is a musicological subject field] | 4 | language_description | > OLAC linguistic_type: language_description, also > OLAC Linguistic subject: language_documentation | 5 | lexicon | > OLAC linguistic_type: lexicon, also > OLAC Linguistic subject: lexicography | 6 | photo | IGNORE - belongs at file level | 7 | primary_text | > OLAC linguistic_type: primary_text | 8 | song | IGNORE [this is covered in a separate field having the OLAC discourse type singing] | 9 | Typological Analysis | > OLAC Linguistic subject: typology | 10 | Sound | IGNORE - belongs at file level | 11 | Movingimage | IGNORE - belongs at file level

@nthieberger @AmandaPDSC does this make sense?

LindaBarwick commented 11 years ago

Regarding

| 8 | song | IGNORE [this is covered in a separate field having the OLAC discourse type "singing"]

I just tried to update in old catalog so that all 540 items with "song" in their data type field would also have the discourse type "singing" but I discovered that the discourse type field is missing in the "update items" tab of the old catalog.

@silviapfeiffer how do you think we could deal with this? temporarily leave "song" as an OLAC linguistic data type for import, then after going live do a bulk update to match all these items with discourse type "singing", then later get you to delete "song" from the OLAC linguistic data type table? or else prepare a spreadsheet for you to import with the item ID | discourse type | data type columns? or else edit the update area of the current catalog to make the discourse type field available for bulk updates there?

nthieberger commented 11 years ago

Yes, this looks good. I'm sorry that I missed the fact of this all being left out until this point of the development

silviapfeiffer commented 11 years ago

I'll try and fix this by next week, so we can actually move over. This is definitely a blocker.

@LindaBarwick I can create the "signing"/"song" fix upon import.

LindaBarwick commented 11 years ago

ok please let me know if you need more from me.

silviapfeiffer commented 11 years ago

@LindaBarwick Can we just continue to have all these values in one table and one field in the interface? I wouldn't want to include OLAC knowledge into the user interface. The mapping to OLAC fields will only be done on the backend when the OLAC feed is created (just as it happens in the old system).

silviapfeiffer commented 11 years ago

Also, can I rename the table to Data Category? I don't like "Data Type" - it's too generic and frequently mixed up with the data type of files (image/video/audio).

LindaBarwick commented 11 years ago

One belongs in CD:subject and one in DC:type and I think ExSite9 may separate them.

I think we need to hear from @jangari and @nthieberger on this one

silviapfeiffer commented 11 years ago

It's ok that they go into different places in the OLAC feed, and even come in through different fields in the ExSite9 feed. But since there are only a small number of values, it seems to be overkill to do more than what the old system did.

silviapfeiffer commented 11 years ago

Also, we have a problem if we create two tables with the contents that you propose: where do we put the other fields that are not exported to OLAC, in particular instrumental_music, photo, song, sound, MovingImage ?

LindaBarwick commented 11 years ago

I have suggested that we just ignore those values.

Linda

silviapfeiffer commented 11 years ago

Ignore them in nabu completely or just when exporting to OLAC?

LindaBarwick commented 11 years ago

They can be ignored in Nabu completely, I think. I've added the 'instrumental music' to the descriptions of all those items, so it's redundant. And the other values (photo, sound, MovingImage) do not belong at item level.

silviapfeiffer commented 11 years ago

Ah right. That changes things. I'm going to wait for confirmation tomorrow if Nick and Aidan also think that's the way to go. I've right now implemented just a full import of the state of the old data_type table.

nthieberger commented 11 years ago

Not sure if I follow completely. It is fine to have the elements listed above in one drop down with multiple possible selections and to sort out export differences later. We do need 'photo, sound, MovingImage' at the item level as well, I'm not sure why they would only be at collection level Linda? Maybe we can talk about this on Monday to clarify?

silviapfeiffer commented 11 years ago

The discussion only refers to item level.

I think Linda is suggesting that we introduce 2 tables as a replacement for the one that used to be data_type.

One called: lingustic data type with the choice of

and a second one called: Linguistic subject vocabulary with the choice of

She has provided a mapping for how to take the existing values to these.

silviapfeiffer commented 11 years ago

Also, she is suggesting to drop:

since the data type of the file already includes this information.

Finally:

will be part of discourse_type and not needed here.

silviapfeiffer commented 11 years ago

I have for now just imported the data from the old system as is.

nthieberger commented 11 years ago

I think we need to be explicit about 'movingimage' as an element as it then can turn up in OLAC export (e.g. in http://www.language-archives.org/item/oai:paradisec.org.au:NT5-StringBand). Are you saying it can be generated because the file type is video and so does not need to be provided in the textual metadata? And the same for 'photo' and for 'sound'? But 'photo' is not predictable (given that a pdf could be a photo for example).

On 23 September 2012 19:54, Silvia Pfeiffer notifications@github.comwrote:

Also, she is suggesting to drop:

  • photo
  • sound
  • Moving Image since the data type of the file already includes this information.

Finally:

  • instrumental_music
  • song will be part of discourse_type and not needed here.

    — Reply to this email directly or view it on GitHubhttps://github.com/nabu-catalog/nabu/issues/205#issuecomment-8796956.

LindaBarwick commented 11 years ago

Not at collection or item level, at file level

LindaBarwick commented 11 years ago

Our items can include files of multiple types e.g sound moving image text XML - so this info properly belongs at file level. Unless you want item level to default to the main data type?

We would need to add another field for this dcmi type

On 23/09/2012, at 8:43 PM, nthieberger notifications@github.com wrote:

I think we need to be explicit about 'movingimage' as an element as it then can turn up in OLAC export (e.g. in http://www.language-archives.org/item/oai:paradisec.org.au:NT5-StringBand). Are you saying it can be generated because the file type is video and so does not need to be provided in the textual metadata? And the same for 'photo' and for 'sound'? But 'photo' is not predictable (given that a pdf could be a photo for example).

On 23 September 2012 19:54, Silvia Pfeiffer notifications@github.comwrote:

Also, she is suggesting to drop:

  • photo
  • sound
  • Moving Image since the data type of the file already includes this information.

Finally:

  • instrumental_music
  • song will be part of discourse_type and not needed here.

— Reply to this email directly or view it on GitHubhttps://github.com/nabu-catalog/nabu/issues/205#issuecomment-8796956.

— Reply to this email directly or view it on GitHub.

silviapfeiffer commented 11 years ago

We don't want to have to manually add information at the file level. That requires an additional level of data entry that is right now fully automated. So, let's stick with it at item level.

At this stage - unless we really want to delay the rollout of Nabu further - I suggest we just go with the same approach that we had in the old system. Since I have already implemented that, I'm going to close this now.

If you have a discussion and come to a different conclusion and want this worked on further, please re-open.

I'm planning to push out the latest state later tonight, so you should be able to test tomorrow.

silviapfeiffer commented 11 years ago

Oops, update will be pushed by this arvo.

AmandaPDSC commented 11 years ago

Can I be pedantic and suggest that all items in the list should be lower case - at the moment Sound, Movingimage and Historical reconstruction begin uppercase but all others are lower case. Also there is a space in Historical Reconstruction but not in Movingimage and all other two-word items use _ between the words. @nthieberger is there any good reason for this?

nthieberger commented 11 years ago

You are right, they should be lowercase, and the space has to be filled by an underscore. Unfortunately, even though we are talking about standards, Dublin Core has 'MovingImage' (http://dublincore.org/documents/dcmi-terms/) so I am not sure if we need to keep that or can use 'moving_image'?

silviapfeiffer commented 11 years ago

I didn't like that either, but that's how I imported it from the old DB. How about making it readable in our interface and then just writing the right values to the XML files and harvesting interfaces? I can fix the import for that.