BOLSTM hackathon - Githubissues

AndreLamurias commented 6 years ago

Although we already got some clarification (#26) it would be great to have some assistance in testing our component as part of an application.

greenwoodma commented 6 years ago

To make sure we are as well prepared as possible to help during the hackathon sessions could you please add/attach to this issue:

The landing page URL of any component/workflow you have registered
The OMTD-SHARE XML file for each component/workflow
One or two sample documents that you expect to produce sensible output for your component/workflow

AndreLamurias commented 6 years ago

Is this the XML that I can download from the "My components" page -> "Download component as XML"?
Should the sample documents be in the XMI format? I'm trying to write a parser for our component but I can't find any examples of how to represent named entities in that format (our component requires documents annotated with chemical entities)

greenwoodma commented 6 years ago

yes, although I assume you probably generated that XML before uploading your component -- it's essentially the same file. Certainly if your component is a GATE or UIMA resource then chances are the XML file is inside the jar you produced (if you meet the guidelines). For docker things are a bit different and you might just have used the form rather than providing raw XML.
documents in a format your component expects would be best. If you are currently accepting plain text or some other format then for now at least use that and we can at least test that your component works on the platform and produces the output you expect, even if it can't be chained with other components into a larger workflow yet.

AndreLamurias commented 6 years ago

To be clear, it is a docker component, and I simply uploaded the XML generated with the form. Please let me know if any changes are necessary. landing page: https://test.openminted.eu/landingPage/component/e4b49cd0-36af-4f95-a98f-8bf1100b8607 OMTD-SHARE XML: e4b49cd0-36af-4f95-a98f-8bf1100b8607.zip documents: sample_corpus.zip

We also want to register another component to train models but I think it would be better to have this one implemented properly first.

pennyl67 commented 6 years ago

@AndreLamurias With regard to the XML metadata, the following elements need some further checking: (1) some clarification on our part: the inputContentResourceInfo is about the documents that are going to be processed; so

does your component indeed take as input XMI documents?
must your documents be already annotated? The way it's described now, it seems that the documents must already be annotated with terms. (2) the outputResourceInfo is about the results of the processing; so:
are the results indeed as TSV? does this mean that the output is a table of relations, and not the corpus with annotations? if yes, then please change the processingResourceType to "lexicalConceptualResource" (3) as a recommendation, you could add information on the resourceCreators (it can be a person, a group or an organization), which will help the citation of the component. For the technical details, more information will be discussed in the hackathon session.

gkirtzou commented 6 years ago

@AndreLamurias 1) In distributionInfo, in the distributionLocation element add also the version of your image for reasons of sustainability and reproducibility. Also make sure that the command (your omtd executor) is available to run from everywhere with the docker image. See also: https://github.com/openminted/Open-Call-Discussions/issues/28#issuecomment-381683553

2) Optional

Please complete the previousAnnotationTypesPolicy, for usability reasons when your component is used within the OMTD workflow editor.
Could you add resourceCreationInfo with resourceCreator? It's not mandatory but it will help the citation process

AndreLamurias commented 6 years ago

updated metadata file: e4b49cd0-36af-4f95-a98f-8bf1100b8607.zip

@pennyl67 (1) No, the actual format is the XML used by the DDI corpus, XMI was left in by mistake (and also because I was trying to implement that parser). And yes, the documents must be already annotated with terms. (2) Yes it is a TSV table, I've update the file. (3) I added that element with the same subelements as metadataCreator, is that ok?

@gkirtzou (1) Just to confirm, the command is a batch file in the root directory of the image, I can run it with docker run -v /corpus/:/corpus_in -v /output/:/output -it bolstm:omtd /predict.sh --param:model=full_model --input=/corpus_in/ --output=/output/, is this ok? The output is saved in a tsv file in /output (2) I have this element: "keep</ns0:previousAnnotationTypesPolicy>" what else should I add? Couldn't find anything here: https://guidelines.openminted.eu/previousannotationtypespolicy.html

Should I also include the metadata file with the docker image, or is that required only for GATE or UIMA based components? Also, since now I edited the file manually, can I now update the component page with the file? Or do I have to do it manually?

gkirtzou commented 6 years ago

@AndreLamurias (1) If the predict.sh script lies in the root folder and you provide in the distributionInfo.componentDistributionInfo.command element the full path it should work. The OMTD platform generates a command similar to the one you say, taken into consideration that the dockers run within our infrastructure. Thus we are requesting only which is your omtd-executor.

(2) You have declared your software as a component, thus it would be used with the OMTD workflow to create an application. That's the reason for requesting the previousAnnotationTypesPolicy element. If your component keeps the term annotations in the output, then Keep is the correct choice. If you keep some, remove some, the modify is the correct choice, if you totally generate new ones, drop should be used.

(3) You don't need to include the metadata file in the docker image.

(4) Since your component is declare as private, you can edit the existing metadata via the menu "Your name" in the top left corner -> My components->Select the pencil "Edit component metadata". I am attaching your metadata record, as I made a small correction to make it a valid xml (the place of CreatorInfo was incorrect) bolstm_v2.xml.zip

(5) A quick question is omtd your tag version in distributionInfo.componentDistributionInfo.distributionLocation?

(6) If your input need to be in the XML format used by the DDI corpus, I would suggest to add also in inputContentResourceInfo.dataFormats.dataFormatInfo the element dataFormatOther, which is a free text declaring this.

I think the rest is ok. I will check again to see if I missed anything.

AndreLamurias commented 6 years ago

@gkirtzou (2) ok so from your explanation I think it would make more sense to use "drop" since the output file does not include the input entity annotations.

(4) I was asking if there was a way to upload the XML since some fields, for example resourceCreator, cannot be edited using the form (at least I couldn't find the respective field).

(5) yes, I created that tag yesterday as I was making a few adjustments to the image.

Thanks for the help! I have updated the metadata using the form and in the file: bolstm_v3.zip

gkirtzou commented 6 years ago

(4) Unfortunately, the registration form does not support the full metadata schema. You could register using xml. You need to go Add->Component->Dockerized Component->Upload XML->Upload

I made a small correction to the OMTD-SHARE description you send me (minor typo that didn't validate towards the schema)

bolstm_v4.xml.zip

AndreLamurias commented 6 years ago

I've created an application just with the component and uploaded a sample corpus and got "System error getting execution status (Server responded: undefined)". Do you have any info about what could have gone failed? Anything I can fix?

gkirtzou commented 6 years ago

@AndreLamurias in order to test your component you need to create a workflow with the omtdImporter as a first step and then your component as a second step.

gkirtzou commented 6 years ago

I will try also to test your component tomorrow. I will let you know how will that go.

AndreLamurias commented 6 years ago

ok thanks! I'm trying to edit the workflow but it gets stuck while saving

gkirtzou commented 6 years ago

@AndreLamurias I created a private application using your component (omtdImporter + bolstm component) and when I ran it with the corpus that you have provided me, I got the following error

ERROR: unknown parameter "tmp"
if this was a real script you would see something useful here

./predict.sh
    -h --help
    --input=
    --output=
    --param:model=

where tmp is the input folder. You need to update your executor (ie predict.sh) so as to be compliant with the OMTD specifications https://openminted.github.io/releases/docker-spec/1.0.0/specification see "Reading input data and writing output data for each TDM component".

Apart from that the model parameter is optional, given your metadata record. Thus it shouldn't required to be set when calling your component, is that right?

AndreLamurias commented 6 years ago

@gkirtzou I've updated the docker image which should have solved that issue.

Apart from that the model parameter is optional, given your metadata record. Thus it shouldn't required to be set when calling your component, is that right?

Yes, the model parameter is optional. BTW right now the default model is packaged with the docker image, inside the models/ directory. I see that it's possible to upload a model on the platform (add->models & grammars), should I rather upload the default model that way?

gkirtzou commented 6 years ago

@gkirtzou I've updated the docker image which should have solved that issue.

Did you update the same image "andrelamurias/bolstm:omtd" ?

BTW right now the default model is packaged with the docker image, inside the models/ directory. I see that it's possible to upload a model on the platform (add->models & grammars), should I rather upload the default model that way?

For the moment it is better to keep the model within your docker image.

AndreLamurias commented 6 years ago

yes, same tag: andrelamurias/bolstm:omtd

gkirtzou commented 6 years ago

I ran you component with the input you provided me, and I got the following error


 --input 'tmp' --output '/srv/galaxy/database/jobs_directory/001/1077/working/out/' --
input_dir is tmp
output_dir is /srv/galaxy/database/jobs_directory/001/1077/working/out/
params is full_model
DEBUG:gensim.models.doc2vec:Fast version of gensim.models.doc2vec is being used
INFO:summa.preprocessing.cleaner:'pattern' package not found; tag filters are not available for English
Using TensorFlow backend.
Traceback (most recent call last):
  File "/src/train_rnn.py", line 25, in <module>
    from chebi_path import load_chebi
  File "/src/chebi_path.py", line 14, in <module>
    ssm.semantic_base("src/DiShIn/chebi.db")
  File "/src/DiShIn/ssm.py", line 39, in semantic_base
    connection = sqlite3.connect(sb_file)
sqlite3.OperationalError: unable to open database file

I think that it has something to do the the path to the file. Probably it is good practice to use full paths.

AndreLamurias commented 6 years ago

Ok I've updated the docker image again, it should be possible to run /predict.sh from any directory

gkirtzou commented 6 years ago

I did docker pull but nothing seem to update. Have you uploaded a new image or have you update the andrelamurias/bolstm:omtd?

AndreLamurias commented 6 years ago

sorry, now it should be updated. it's still the same tag. can you try again?

gkirtzou commented 6 years ago

I tried it again and I got the following error

 --input 'tmp' --output '/srv/galaxy/database/jobs_directory/001/1081/working/out/' --
input_dir is tmp
output_dir is /srv/galaxy/database/jobs_directory/001/1081/working/out/
params is full_model
DEBUG:gensim.models.doc2vec:Fast version of gensim.models.doc2vec is being used
INFO:summa.preprocessing.cleaner:'pattern' package not found; tag filters are not available for English
Using TensorFlow backend.
INFO:root:new chebi dictionary
/predict.sh: line 46:    15 Illegal instruction     (core dumped) python /src/train_rnn.py preprocessing_predict ddi full_model $INPUTDIR $OUTPUTDIR words wordnet common_ancestors concat_ancestors

and the docker exited with 132 status code. I am not sure whether this is a memory issue or it has to do with hardware requirements for keras or/and tensorflow.

Could you please tell me how much memory does your software require? Any help will be appreciated.

PS. Please complete also this spreadsheet https://docs.google.com/spreadsheets/d/1EzLcAvx5MaCUs2-4JEK5YCzXKSS9YdGxq3GeMrwlKYQ/edit#gid=0

gkirtzou commented 6 years ago

@AndreLamurias any news? Have you looked into it?

AndreLamurias commented 6 years ago

Based on the output I don't think it would have been a memory issue, so I've updated the image with newer versions of some python libraries (keras, tensorflow and spacy). This person had that the same issue with the same version of python and spacy: https://github.com/explosion/spaCy/issues/1589 The minimum memory is 7GB, I've filled the spreadsheet with the information asked.

gkirtzou commented 6 years ago

@AndreLamurias Ok I will pul the image again and test. We are also working to support your memory requirements to verify that isn't a memory issue. I will let you know.

AndreLamurias commented 6 years ago

ok thanks, hopefully upgrading the libraries solved that issue, otherwise, it might be difficult to debug it.

gkirtzou commented 6 years ago

@AndreLamurias Sorry for my late response, but we had some issues with our workflow engine cluster, thus we couldn't test your component. Today, I was able to run it and I got the following error


 --input 'tmp' --output '/srv/galaxy/database/jobs_directory/001/1145/working/out/' --
input_dir is tmp
output_dir is /srv/galaxy/database/jobs_directory/001/1145/working/out/
params is full_model
DEBUG:gensim.models.doc2vec:Fast version of gensim.models.doc2vec is being used
INFO:summa.preprocessing.cleaner:'pattern' package not found; tag filters are not available for English
Using TensorFlow backend.
INFO:root:new chebi dictionary
INFO:root:processing entities: Abacavir.xml
INFO:root:processing entities: 21705423.xml
INFO:root:parsing Abacavir.xml
loading chebi from /data//chebi.obo...
synonyms to ids...
done. 109026 ids 231938 synonyms
best names of  ZIAGEN : ('Zierin', 67)
best synonyms of  ZIAGEN : [('Agene', 73), ('Shiragen', 71), ('Agn', 67), ('Zierin', 67), ('AGE', 67), ('magenta I', 67), ('argent', 67), ('biligen', 62), ('Acigena', 62), ('dicyanogen', 62)]
synonyms ZIAGEN ('Zierin', 67) [('Agene', 73), ('Shiragen', 71), ('Agn', 67), ('Zierin', 67), ('AGE', 67), ('magenta I', 67), ('argent', 67), ('biligen', 62), ('Acigena', 62), ('dicyanogen', 62)]
best names of  3-[(2-methyl-1,3-thiazol-4-yl) ethynyl] pyridine : ('3-ethynyl-5-(1-methyl-2-pyrrolidinyl)pyridine', 80)
Traceback (most recent call last):
Saving chebi dictionary...!
  File "/src/train_rnn.py", line 734, in <module>
    main()
  File "/src/train_rnn.py", line 469, in main
    X_train_wordnet = get_ddi_data([sys.argv[4]], name_to_id, synonym_to_id, id_to_name)
  File "/src/train_rnn.py", line 281, in get_ddi_data
    neg_gv, pos_gv = get_ddi_sdp_instances(dir, name_to_id, synonym_to_id, id_to_name)
  File "/src/parse_ddi.py", line 137, in get_ddi_sdp_instances
    parsed_sentences, wordnet_sentences = parse_ddi_sentences_spacy(base_dir, entities)
  File "/src/parse_ddi.py", line 112, in parse_ddi_sentences_spacy
    tree = ET.parse(base_dir + f)
  File "/usr/local/lib/python3.5/xml/etree/ElementTree.py", line 1195, in parse
    tree.parse(source, parser)
  File "/usr/local/lib/python3.5/xml/etree/ElementTree.py", line 585, in parse
    source = open(source, "rb")
FileNotFoundError: [Errno 2] No such file or directory: 'tmpAbacavir.xml'

tmp is the input folder. I think you are missing the intermediate '/' to properly define the path to input files. Could you correct that? Thanks

AndreLamurias commented 6 years ago

@gkirtzou I've updated the image, let me know if there are more errors

gkirtzou commented 6 years ago

@AndreLamurias I reran the new image and I got the following error

 --input 'tmp' --output '/srv/galaxy/database/jobs_directory/001/1162/working/out/' --
input_dir is tmp
output_dir is /srv/galaxy/database/jobs_directory/001/1162/working/out/
params is full_model
DEBUG:gensim.models.doc2vec:Fast version of gensim.models.doc2vec is being used
INFO:summa.preprocessing.cleaner:'pattern' package not found; tag filters are not available for English
Using TensorFlow backend.
INFO:root:new chebi dictionary
INFO:root:processing entities: Abacavir.xml
INFO:root:processing entities: 21705423.xml
INFO:root:parsing Abacavir.xml
INFO:root:parsing 21705423.xml

tagger_light::bitag(
    modelname_pos:  /sst-light-0.4//MODELS/WSJPOSc_base_20
    tagsetname_pos: /sst-light-0.4//DATA/WSJPOSc.TAGSET
    modelname:  /sst-light-0.4//MODELS/SEM07_base_12
    targetname: /temp//sentences_0.txt
    tagset: /sst-light-0.4//DATA/WNSS_07.TAGSET
    lowercase:  0
    output: /temp//sentences_0.txt.tags )
load_words_notag(/temp//sentences_0.txt)    |D| = 12    |ID| = 12
tagger_light.init(/sst-light-0.4//MODELS/WSJPOSc_base_20,/sst-light-0.4//DATA/WSJPOSc.TAGSET)   |Y| = 45/45
LoadModel(/sst-light-0.4//MODELS/WSJPOSc_base_20)....   OK
tagger_light.init(/sst-light-0.4//MODELS/SEM07_base_12,/sst-light-0.4//DATA/WNSS_07.TAGSET) |Y| = 93/93
LoadModel(/sst-light-0.4//MODELS/SEM07_base_12).... OKtagging:
.
INFO:root:generating instances: Abacavir.xml
INFO:root:skipped gv combination ['lamivudine-12', 'combination-17', 'of-18', 'lamivudine-19', 'zidovudine-21']:
INFO:root:skipped gv pharmacokinetics ['lamivudine-5', 'pharmacokinetics-8', 'zidovudine-7']:
INFO:root:skipped gv pharmacokinetics ['zidovudine-7', 'pharmacokinetics-8', 'lamivudine-5', 'changes-3', 'observed-10', 'following-11', 'administration-13', 'of-14', 'abacavir-15']:
INFO:root:generating instances: 21705423.xml
WARNING:root:('head token conflict:', 'DDI-MedLine.d161.s4.e2', ([366, 414], '3-[(2-methyl-1,3-thiazol-4-yl) ethynyl] pyridine', 'CHEBI:125396'))
WARNING:root:('head token conflict:', 'DDI-MedLine.d161.s5.e3', ([200, 244], '1-methyl-4-phenyl-1,2,3,6-tetrahydropyridine', 'CHEBI:17963'))
WARNING:root:('head token conflict:', 'DDI-MedLine.d161.s6.e0', ([394, 442], '3-[(2-methyl-1,3-thiazol-4-yl) ethynyl] pyridine', 'CHEBI:125396'))
WARNING:root:('head token conflict:', 'DDI-MedLine.d161.s8.e2', ([283, 327], '1-methyl-4-phenyl-1,2,3,6-tetrahydropyridine', 'CHEBI:17963'))
INFO:gensim.models.keyedvectors:loading projection weights from /data//PubMed-w2v.bin
INFO:gensim.models.keyedvectors:loaded (2351706, 200) matrix from /data//PubMed-w2v.bin
WARNING:tensorflow:From /usr/local/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py:1247: calling reduce_sum (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.
Instructions for updating:
keep_dims is deprecated, use keepdims instead
WARNING:tensorflow:From /usr/local/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py:1213: calling reduce_max (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.
Instructions for updating:
keep_dims is deprecated, use keepdims instead
loading chebi from /data//chebi.obo...
synonyms to ids...
done. 109026 ids 231938 synonyms
['words', 'wordnet', 'common_ancestors', 'concat_ancestors']
Loaded model /models//full_model from disk
Saving chebi dictionary...!
Traceback (most recent call last):
  File "/src/train_rnn.py", line 734, in <module>
    main()
  File "/src/train_rnn.py", line 483, in main
    X_train_ancestors, id_to_index)
  File "/src/train_rnn.py", line 447, in predict
    with open("{}/{}_{}results.txt".format(outputpath, modelname, corpusname.replace("/", ".")), 'w') as f:
FileNotFoundError: [Errno 2] No such file or directory: '/srv/galaxy/database/jobs_directory/001/1162/working/out//full_model_tmpresults.txt'

I see that there are to '//' before the output filename. I think that you need to check in the path where it is ending with / or not and handling appropriately.

AndreLamurias commented 6 years ago

I don't think the double slashes would be an issue, but I can't think of any other explanation so I've pushed another image with the change that you suggested.

gkirtzou commented 6 years ago

You where right. It was the double / in the filename the problem. I ran the new image and we got the same error ....

 --input 'tmp' --output '/srv/galaxy/database/jobs_directory/001/1185/working/out/' --
input_dir is tmp
output_dir is /srv/galaxy/database/jobs_directory/001/1185/working/out/
params is full_model
DEBUG:gensim.models.doc2vec:Fast version of gensim.models.doc2vec is being used
INFO:summa.preprocessing.cleaner:'pattern' package not found; tag filters are not available for English
Using TensorFlow backend.
INFO:root:new chebi dictionary
INFO:root:processing entities: Abacavir.xml
INFO:root:processing entities: 21705423.xml
INFO:root:parsing Abacavir.xml
INFO:root:parsing 21705423.xml

tagger_light::bitag(
    modelname_pos:  /sst-light-0.4//MODELS/WSJPOSc_base_20
    tagsetname_pos: /sst-light-0.4//DATA/WSJPOSc.TAGSET
    modelname:  /sst-light-0.4//MODELS/SEM07_base_12
    targetname: /temp//sentences_0.txt
    tagset: /sst-light-0.4//DATA/WNSS_07.TAGSET
    lowercase:  0
    output: /temp//sentences_0.txt.tags )
load_words_notag(/temp//sentences_0.txt)    |D| = 12    |ID| = 12
tagger_light.init(/sst-light-0.4//MODELS/WSJPOSc_base_20,/sst-light-0.4//DATA/WSJPOSc.TAGSET)   |Y| = 45/45
LoadModel(/sst-light-0.4//MODELS/WSJPOSc_base_20)....   OK
tagger_light.init(/sst-light-0.4//MODELS/SEM07_base_12,/sst-light-0.4//DATA/WNSS_07.TAGSET) |Y| = 93/93
LoadModel(/sst-light-0.4//MODELS/SEM07_base_12).... OKtagging:
.
INFO:root:generating instances: Abacavir.xml
INFO:root:skipped gv combination ['lamivudine-12', 'combination-17', 'of-18', 'lamivudine-19', 'zidovudine-21']:
INFO:root:skipped gv pharmacokinetics ['zidovudine-7', 'pharmacokinetics-8', 'lamivudine-5', 'changes-3', 'observed-10', 'following-11', 'administration-13', 'of-14', 'abacavir-15']:
INFO:root:skipped gv pharmacokinetics ['lamivudine-5', 'pharmacokinetics-8', 'zidovudine-7']:
INFO:root:generating instances: 21705423.xml
WARNING:root:('head token conflict:', 'DDI-MedLine.d161.s4.e2', ([366, 414], '3-[(2-methyl-1,3-thiazol-4-yl) ethynyl] pyridine', 'CHEBI:125396'))
WARNING:root:('head token conflict:', 'DDI-MedLine.d161.s5.e2', ([245, 293], '3-[(2-methyl-1,3-thiazol-4-yl) ethynyl] pyridine', 'CHEBI:125396'))
WARNING:root:('head token conflict:', 'DDI-MedLine.d161.s6.e0', ([394, 442], '3-[(2-methyl-1,3-thiazol-4-yl) ethynyl] pyridine', 'CHEBI:125396'))
WARNING:root:('head token conflict:', 'DDI-MedLine.d161.s8.e1', ([328, 376], '3-[(2-methyl-1,3-thiazol-4-yl) ethynyl] pyridine', 'CHEBI:125396'))
INFO:gensim.models.keyedvectors:loading projection weights from /data//PubMed-w2v.bin
INFO:gensim.models.keyedvectors:loaded (2351706, 200) matrix from /data//PubMed-w2v.bin
WARNING:tensorflow:From /usr/local/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py:1247: calling reduce_sum (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.
Instructions for updating:
keep_dims is deprecated, use keepdims instead
WARNING:tensorflow:From /usr/local/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py:1213: calling reduce_max (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.
Instructions for updating:
keep_dims is deprecated, use keepdims instead
loading chebi from /data//chebi.obo...
synonyms to ids...
done. 109026 ids 231938 synonyms
['words', 'wordnet', 'common_ancestors', 'concat_ancestors']
Loaded model /models//full_model from disk
Traceback (most recent call last):
  File "/src/train_rnn.py", line 736, in <module>
    main()
  File "/src/train_rnn.py", line 485, in main
    X_train_ancestors, id_to_index)
  File "/src/train_rnn.py", line 449, in predict
    with open("{}/{}_{}results.txt".format(outputpath, modelname, corpusname.replace("/", ".")), 'w') as f:
FileNotFoundError: [Errno 2] No such file or directory: '/srv/galaxy/database/jobs_directory/001/1185/working/out/full_model_tmpresults.txt'
Saving chebi dictionary...!

Is full_model_tmpresults.txt is a file you are generating, right? Then how it cannot be found?

AndreLamurias commented 6 years ago

Yes it should create the file or overwrite if it already exists. could it be an issue with permissions? is it possible to try a different output path?

gkirtzou commented 6 years ago

I don't think that a different path will work, it has to do with how galaxy is handling the docker containers. A quick question, do you create the output folder or do you think it exist? Ideally you need to create the output dir if it does not exists. Could that be the problem?

AndreLamurias commented 6 years ago

The script assumed that the provided output path existed, but I can change it so that it creates the directory if it doesn't exist. Just to confirm, in your test, the script should check if "/srv/galaxy/database/jobs_directory/001/1185/working/out" exists, and if not it should create all the missing directories?

AndreLamurias commented 6 years ago

I've updated the image so that it creates the output path if it doesn't exist

gkirtzou commented 6 years ago

I was able to reran your component successfully!!! I am attaching the generated file. Could you verify that it is what you are expecting? As an input I am using the corpus you have provided us.

output.tsv.zip

screenshot at 2018-05-03 13-22-51

AndreLamurias commented 6 years ago

Great! Yes that it the expected result. Thanks for the help. Do you think now would be a good idea to implement XMI as an input format, so that the component can be added to workflows that use that format? If you could send me an example file of a document annotated with chemical entities I think it wouldn't take much time to implement that.

gkirtzou commented 6 years ago

It is recommended to use xmi as input, as this would allow your component to be interoperable with other components, and will be easier to be used within the corpora generated within the OMTD platform (usually pdf files). In this case, an application would need to have at least the following components, in that order:

omtdImporter, the component that get the data from registry to workflow engine
a converter, ie tika, that converts files from a variety of mimetypes to xmi
your component

AndreLamurias commented 6 years ago

@gkirtzou how can I obtain some XMI files with chemical named entity annotations that I could use to test?

gkirtzou commented 6 years ago

@AndreLamurias I not aware of any component that could generate chemical named entity annotations. In the platform there is an application that does that (https://test.openminted.eu/landingPage/application/756502de-a422-4ba4-9c95-04a7d5e9e28c) and it might be what you are asking. The problem is that the way we have for the moment implement the platform you cannot process an annotated corpus generated from an application.

gkirtzou commented 6 years ago

Dear @AndreLamurias you can now proceed to the uploading of your component at https://services.openminted.eu/home

Please, when you upload your component, create the appropriate workflow so that someone could run your component using the workflow editor. For more info see https://openminted.github.io/releases/workflow-editor/ Please also let me know when you have uploaded the your component to the production site. If you encounter any problems, please let us know. Thanks

AndreLamurias commented 6 years ago

@gkirtzou thanks, I will upload the component soon. I assume that I have to make it public. Then, if I ever then to fix a bug or update a model, for example, should I submit a new version of the tool or can I push the update to dockerhub and OMTD will pull the new image?

gkirtzou commented 6 years ago

@AndreLamurias yes we want it to be public and also don't forget to create an application so that non expert end users could use your software without any tuning. For reproducibility reasons if you update the sotfware (either due to bug or due to an update of the model), it would be good to upload a new docker image with another version and reregister it to OMTD. You could add in the metadata record the relation "replace" pointing to the previous version of your component, in order to indicate that this is a new version. Also you could add this type of info in the description as well, for human consumption.

gkirtzou commented 6 years ago

Note that when you create an application, you will be asked to fill in a metadata record. Some tips for filling it in - so that they are discoverable by the users but also that users can cite you and your resource.

Give the application a unique name that humans can read and short enough
Give an explanatory description; remember you can re-use the description of the component(s) in the workflow accumulatively
Add in the inputProcessingResourceInfo the information of the first component used in the workflow (the one after omtdImpoter), and in the outputResourceInfo the information of the last component; if the annotations from previous components are retained in the final output, please add those as well
(optionally) You can use the relations set with the relation "hasPart" to document the components used in the workflow - it can be repeated multiple times.

If you encounter any problems, please let us know. Thanks

AndreLamurias commented 6 years ago

I have uploaded the component: https://services.openminted.eu/landingPage/component/32dd1eb5-ae53-4e97-8cb7-b7bb5849c5cb

Here is the workflow using the component: https://services.openminted.eu/landingPage/application/4e806362-fb26-403f-ae9e-9716fa5f2cef I have not made it public yet in case there's some issue with the metadata: 4e806362-fb26-403f-ae9e-9716fa5f2cef.zip

Should I also upload the sample corpus?

gkirtzou commented 6 years ago

@AndreLamurias A few comments

For the component I notice that the image your register is the following andrelamurias/bolstm:omtd-v1, correct? Also in the versionInfo you had declare as version 1.0.0. This is a bit inconsistent. Which is the correct version? omtd-v1 or 1.0.0 ? I am sorry that I didn't mentioned it earlier. I just noticed it. Could you change it? I can ask to delete your component if you agree.

Also, there is a new suggestion that you could add the deliverables of the Hackathon to your github repo and add this as well as documentation to your component

For the application the metadata seems perfect. You could add yourslef as the resourceCreator as well for provenance and citation reasons. Note, that if we delete the component above, you would need to create the workflow. (to be sure that everything work)

Sample corpus In your case, a sample corpus is necessary as your component cannot work out of the box with corpora generated in the OMTD platform. Please, upload your corpus, following the omtd instructions, and give me the metadata to check them before make it public.

AndreLamurias commented 6 years ago

For the component

Ok I have uploaded a new component using andrelamurias/bolstm:omtd-v1.0.0 to keep things consistent: https://services.openminted.eu/landingPage/component/f2ef9b8f-d4bf-4cdd-8380-c7e63f1a0dc2

Also, there is a new suggestion that you could add the deliverables of the Hackathon to your github repo and add this as well as documentation to your component

Do you mean the deliverables of the open tender? I can upload those documents to the repo if that's what you mean.

For the application

I'll wait for you to delete the previous component to update the workflow so that I don't use the old one by mistake.

Sample corpus

Here is the link: https://services.openminted.eu/landingPage/corpus/472a3e95-31c3-43b6-9ade-48edd75db377 And metadata: 472a3e95-31c3-43b6-9ade-48edd75db377.zip

thanks!

gkirtzou commented 6 years ago

The old version of the component has been deleted, so you can update the workflow.
Yes I am talking about the deliverables of the open tender. You can add them to the repo and use this information as extra documentation in the metadata description of the component.
- One small suggestion to the corpus. Can you just in the description that the corpus is already annotated at the terms level, it that correct?
Could you please also test that the application runs with your sample corpus, as expected?

Thanks

AndreLamurias commented 6 years ago

@gkirtzou | One small suggestion to the corpus. Can you just in the description that the corpus is already annotated at the terms level, it that correct?

The only place to add this information is in the description, right? I was looking for a specific field in the form but couldn't find it.

I've added the deliverables to the github repo and I will add it as documentation when I upload a new version of the component.

Right now I cannot add BOLSTM to a workflow, it doesn't appear in the workflow editor, even if I start a new application. However it does appear on "My components" and it's public.

openminted / Open-Call-Discussions

BOLSTM hackathon #32