Input and output data types for SoftwareApplication

martin-nc commented 7 years ago

I'm trying to map the bio.tools schema to the SoftwareApplication type. Bio.tools is a registry of life science tools and services, and its schema maps fairly well to SoftwareApplication. There is one property in the bio.tools schema that doesn't map, though, and that is data type (for both input and output).

By data type bio.tools does not mean 'data format' (XML, JSON, CSV etc) but descriptive keywords like 'ontology', 'protein alignment' or 'protein sequence' e.g. SignalP has an input of type 'sequence' and an output of type 'protein features'.

I don't see a property in SoftwareApplication to express data types in this sense. Is it worth adding input data and output data types to SoftwareApplication? Or is there another approach I could use?

thadguidry commented 7 years ago

Deal with the Data as content.... the data can be thought of as a CreativeWork. The SoftwareApplication is a different Type than the Data that the SoftwareApplication consumes or transforms or produces.

Treat the input/output as a Dataset perhaps as well.

You can use http://schema.org/keywords on your Dataset.

You can also always use http://schema.org/additionalType and point to a URL that already has a good Type or better suited type for your Data.

martin-nc commented 7 years ago

@thadguidry thanks for that. So for my above example could you do something like:

<script type="application/ld+json">
{
 "@context": "http://schema.org",
 "@type": "SoftwareApplication",
 "name": "SignalP",
 ...
 "additionalType": [
     {
       "@type": "Dataset",
       "Keywords": "sequence",
       "Keywords": "protein features"
     }],
 ...
 }
</script>

The only snag with this is that it's not clear which keyword describes the input and which describes the output. Using about would have the same problem. Ideally there'd be a property to describe a 'before' and 'after' state.

thadguidry commented 7 years ago

@martin-nc The SoftwareApplication "SignalP" is not additionally a Dataset, but uses one or consumes one. (but perhaps your example was not wired up that way intentionally).

Yeah, so since your really trying to capture some Input and some Output process and then label that Input and that Output...you'll need to find a Type that you can use... or begin forming a proposal for that. I/O, IO, InputOutput...are typically needed and used by SoftwareApplication's. I'm thinking that you are looking for a new Type perhaps called "IO" or "Process", or similar, that would have various properties under it, such as your need for input and output, that would expect a Type of Input and Output as well.

I don't want to be the one here that's framing this up however, because I don't have direct needs, but you do... so give it some more thought and put on paper what you think ideally you'd like to see and then gather feedback here.

UPDATE: Looking at bio-tools ... I do see that it has the idea of an EDAM operation ... that's probably where my above analysis comes into play. Where an EDAM operation (function) processes input data and output data. http://bioportal.bioontology.org/ontologies/EDAM?p=classes&conceptid=operation_0004

martin-nc commented 7 years ago

@thadguidry Yes, I'll have a look for a Type that has the concept of 'before' and 'after', or 'input' and 'output'. If not I'll put together a proposal - I'm thinking it could be abstract and apply to any situation where there is a process that changes information or material from one state to another.

Edit: Perhaps ControlAction provides one approach:

<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "SoftwareApplication",
"name": "SignalP",
...
"additionalType": {
   "@type": "ControlAction",
   "object": {
    "@type": "Thing",
    "name": "sequence"
  },
   "result": {
    "@type": "Thing",
    "name": "protein features"
  }
},
...
}
</script>

Again, I'm not sure if I've used additionalType correctly here. Perhaps SoftwareApplication could have an 'operation' property that is a ControlAction? Then we have the concept of an object being acted upon, and a result.

thadguidry commented 7 years ago

@martin-nc Yes, you have the right idea... you would want a property on SoftwareApplication to expand and extend into the IO parts. That might be a new property called 'operation' or somesuch. Leave it up to you. Play around and think through it more. Pull in a use case that is outside your own Bio domain...like perhaps Web Development with APIs or describing a program like Pentaho or Talend that are ETL software applications that I use daily, that can process lots of things and typically have multiple IO going on to databases, apis, files, you name it. In fact, I do LABEL each of my IO processes on both Input side and Output side. Example: "input from customerXYZ.dataset.Sequences Oracle 12c" "output to SOAPInterface dataset.ABC.transformedSequences".

Side Notes that don't directly apply here, but can give you food for thought:

SMO , Software Measurement Ontology from ~2009 can also help you later, if you need to dive into the Measures as part of the domain of Software itself https://pdfs.semanticscholar.org/adce/8d4976712f21d593d56bee787c77f7e460f2.pdf
As I did some research for you... and remembered what some of my scientist friends struggle with as well with measurements... for instance, Salinity is considered a Measure ... but can also be considered as a Measurable Concept, or in other ontologies a Measurable Entity. That's because Salinity is a loose contextual term (what was really measured and how completely differs by researchers and over past history...its not a hard measurement term, but actually a Concept) and with a newer standard like TEOS-10 introduced in 2010 has become a bit less loose for scientists and maturing into a hard measurement term, i.e., "Salinity as defined by TEOS-10" versus just "Salinity".

I need to reference Note 2 above for the QUDT discussion issue #1390

martin-nc commented 7 years ago

@thadguidry Thanks for that. I wonder if you could just embed ControlAction into other types to express transformation, where the type it is embedded in is the agent (with an object and result). For example, you could have a 'Government Organization' that is a recycling plant and embed ControlAction where the object is 'Plastics' and result is 'textiles' i.e. the plant converts plastics to textile. In your scientific context then some experiments will convert one substance or quantity into another, so ControlAction could be useful there.

On the other hand, looking at the schemas list I don't think that many would involve transformations e.g. books and events don't tend to transform things. So perhaps it's best to focus on SoftwareApplication. The examples here will most likely be non-commercial software - in particular command line tools that perform an analysis.

I'd guess that commercial software or software with GUIs tend to be more complex and don't have an input and output that is easy to define. Adobe Camera Raw, for example, converts RAW image files to other file formats (jpegs etc), but does much more than that. The sorts of cases I'm thinking about would be smaller programs with fewer features. As a very simple example, you could have a macro in VBA that converted a suitably formatted Word file to HTML. Another graphics example would be 2Jpeg. The bio.tools registry contains examples for the life sciences (e.g. Kaiju). So my suggestion would be to have a new property 'operation' for SoftwareApplication that expects a type ControlAction.

thadguidry commented 7 years ago

@martin-nc However, I don't share the need nor do I see a clear use case here in the discussion for having the 'operation' property on SoftwareApplication. I think the best you can do here is to have those 'labels' or 'categories' on the Input and Output side. Currently, Schema.org does not have a way for you to say "process data" on the SoftwareApplication, because there typically isn't a need there for searchability. But if you have a bunch of smaller programs and you want to describe some aspect or features or categories that they deal with...then the best way I think would be to just use:

http://schema.org/featureList http://schema.org/applicationCategory http://schema.org/applicationSubCategory http://schema.org/about

Knowing that you cannot discern between those properties above against each individual Input or Output process, but only at the SoftwareApplication type, not the individual IO. And I think that's good enough and still helps discovery about your SoftwareApplication's. Any further schema diving and then you get into a very domain specific metadata sharing need...describing the IO's. And that discussion is somewhat already happening and surfacing in our IoT development discussion: https://groups.google.com/forum/#!forum/sdo-iot-sync Feel free to participate there as well.

martin-nc commented 7 years ago

@thadguidry Thanks again for your advice and help here. We're already using some of the properties you mention (see the draft Bioschemas Tools specification). I'm thinking the solution might be this:

<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "SoftwareApplication",
"name": "SignalP",
"potentialAction": {
   "@type": "ControlAction",
   "object": {
      "@type": "Dataset",
      "about": "Input",
      "keywords": "Sequence"
   },
   "result": {
      "@type": "Dataset",
      "about": "Output",
      "keywords": "Protein features"
  }
},
...
}
</script>

It validates in the structured data testing tool, if that means anything!

joncison commented 7 years ago

Am I right in thinking that the value of @type can be any string, so e.g. the following would be OK:

"@type" : "http://edamontology.org/data_2044" ?

martin-nc commented 7 years ago

@joncison I don't think so. In the schema.org context, the @type is to embed another schema.org type from this list. If you wanted to use the url, and simplify the example in my previous post:

<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "SoftwareApplication",
"potentialAction": {
   "@type": "ControlAction",
   "object": {
      "@type": "Dataset",
      "about": "http://edamontology.org/data_2044"
   },
   "result": {
      "@type": "Dataset",
      "about": "http://edamontology.org/data_2544"
  }
},
...
}
</script>

i.e. I didn't need to have used about for input and output in my previous example - object and result are enough. I made the second url up, but you get the gist!

joncison commented 7 years ago

hmm, but then don't we lose the link to input and output, or is this implied by object and result ?

Could the URLs go in keywords instead? Seems the right place.

Sorry if I'm asking daft questions which I might not have needed to ask if I'd have read the thread more closely :-/

thadguidry commented 7 years ago

@martin-nc @joncison No, No... please use what we already provide ! https://schema.org/additionalType (if you really want you can use schema:about , but that says some Thing is about some Thing. Not that it is a Type of Thing... different case there)

"result": {
      "@type": "Dataset",
      "additionalType": "http://edamontology.org/data_2544"

martin-nc commented 7 years ago

@thadguidry @joncison Ah, okay. I was thinking of about because then people who weren't using an ontology could just use text rather than a url. I don't think that's going to be a common use case, though, so it sounds like additionalType is the way to go. And as you say 'about' is not the same as 'is'.

thadguidry commented 7 years ago

@martin-nc So can you close this issue ? you good now, you think ?

martin-nc commented 7 years ago

@thadguidry Yep, I'll do that now. Thanks for your help!

schemaorg / schemaorg

Input and output data types for SoftwareApplication #1431