opencb / cellbase

High-Performance NoSQL database and RESTful web services to access to most relevant biological data
Apache License 2.0
89 stars 53 forks source link

Loading data model #281

Closed apaytuvi closed 8 years ago

apaytuvi commented 8 years ago

When trying to load my data into MongoDB, after doing the following commands:

../build/bin/cellbase.sh download -d genome,gene -s slycopersicum -o slycopersicum -C configuration.json ../build/bin/cellbase.sh build -i /Synology/server_variants/cellbase/data/slycopersicum/solanum_lycopersicum_gca_000188115.2/ --common /Synology/server_variants/cellbase/data/slycopersicum/common/ -d genome,gene -o slycopersicum_build/ -C configuration.json ../build/bin/cellbase.sh load -d genome,gene --database cellbase_slycopersicum_sl2.50 -i slycopersicum_build/ -L debug -Dmongodb-index-folder=/Synology/server_variants/cellbase/cellbase-app/app/mongodb-scripts/ -C configuration.json

it throws this error:

[main] INFO org.opencb.cellbase.core.loader.LoadRunner - 1000 records read from slycopersicum_build/genome_sequence.json.gz [main] INFO org.opencb.cellbase.core.loader.LoadRunner - 2000 records read from slycopersicum_build/genome_sequence.json.gz [main] INFO org.opencb.cellbase.core.loader.LoadRunner - 3000 records read from slycopersicum_build/genome_sequence.json.gz [main] INFO org.opencb.cellbase.core.loader.LoadRunner - 4000 records read from slycopersicum_build/genome_sequence.json.gz [main] INFO org.opencb.cellbase.core.loader.LoadRunner - 5000 records read from slycopersicum_build/genome_sequence.json.gz [main] INFO org.opencb.cellbase.core.loader.LoadRunner - 6000 records read from slycopersicum_build/genome_sequence.json.gz [pool-1-thread-2] INFO org.mongodb.driver.connection - Opened connection [connectionId{localValue:3, serverValue:273}] to localhost:27017 [pool-1-thread-1] INFO org.mongodb.driver.connection - Opened connection [connectionId{localValue:2, serverValue:272}] to localhost:27017 [main] INFO org.opencb.cellbase.core.loader.LoadRunner - 7000 records read from slycopersicum_build/genome_sequence.json.gz [main] INFO org.opencb.cellbase.core.loader.LoadRunner - 8000 records read from slycopersicum_build/genome_sequence.json.gz [main] INFO org.opencb.cellbase.core.loader.LoadRunner - 9000 records read from slycopersicum_build/genome_sequence.json.gz [main] INFO org.opencb.cellbase.core.loader.LoadRunner - 10000 records read from slycopersicum_build/genome_sequence.json.gz [main] INFO org.opencb.cellbase.core.loader.LoadRunner - 11000 records read from slycopersicum_build/genome_sequence.json.gz [main] INFO org.opencb.cellbase.core.loader.LoadRunner - 12000 records read from slycopersicum_build/genome_sequence.json.gz java.lang.IndexOutOfBoundsException: Index: 0, Size: 0 at java.util.ArrayList.rangeCheck(ArrayList.java:653) at java.util.ArrayList.get(ArrayList.java:429) at com.mongodb.connection.WriteCommandProtocol.receiveMessage(WriteCommandProtocol.java:238) at com.mongodb.connection.WriteCommandProtocol.execute(WriteCommandProtocol.java:104) at com.mongodb.connection.InsertCommandProtocol.execute(InsertCommandProtocol.java:67) at com.mongodb.connection.InsertCommandProtocol.execute(InsertCommandProtocol.java:37) at com.mongodb.connection.DefaultServer$DefaultServerProtocolExecutor.execute(DefaultServer.java:159) at com.mongodb.connection.DefaultServerConnection.executeProtocol(DefaultServerConnection.java:286) at com.mongodb.connection.DefaultServerConnection.insertCommand(DefaultServerConnection.java:115) at com.mongodb.operation.MixedBulkWriteOperation$Run$2.executeWriteCommandProtocol(MixedBulkWriteOperation.java:455) at com.mongodb.operation.MixedBulkWriteOperation$Run$RunExecutor.execute(MixedBulkWriteOperation.java:646) at com.mongodb.operation.MixedBulkWriteOperation$Run.execute(MixedBulkWriteOperation.java:401) at com.mongodb.operation.MixedBulkWriteOperation$1.call(MixedBulkWriteOperation.java:179) at com.mongodb.operation.MixedBulkWriteOperation$1.call(MixedBulkWriteOperation.java:168) at com.mongodb.operation.OperationHelper.withConnectionSource(OperationHelper.java:230) at com.mongodb.operation.OperationHelper.withConnection(OperationHelper.java:221) at com.mongodb.operation.MixedBulkWriteOperation.execute(MixedBulkWriteOperation.java:168) at com.mongodb.operation.MixedBulkWriteOperation.execute(MixedBulkWriteOperation.java:74) at com.mongodb.Mongo.execute(Mongo.java:781) at com.mongodb.Mongo$2.execute(Mongo.java:764) at com.mongodb.MongoCollectionImpl.bulkWrite(MongoCollectionImpl.java:291) at org.opencb.commons.datastore.mongodb.MongoDBNativeQuery.insert(MongoDBNativeQuery.java:154) at org.opencb.commons.datastore.mongodb.MongoDBCollection.insert(MongoDBCollection.java:378) at org.opencb.cellbase.mongodb.loader.MongoDBCellBaseLoader.load(MongoDBCellBaseLoader.java:514) at org.opencb.cellbase.mongodb.loader.MongoDBCellBaseLoader.prepareBatchAndLoad(MongoDBCellBaseLoader.java:359) at org.opencb.cellbase.mongodb.loader.MongoDBCellBaseLoader.call(MongoDBCellBaseLoader.java:311) at org.opencb.cellbase.mongodb.loader.MongoDBCellBaseLoader.call(MongoDBCellBaseLoader.java:50) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) [pool-1-thread-2] ERROR org.opencb.cellbase.mongodb.loader.MongoDBCellBaseLoader - Error Loading batch: Index: 0, Size: 0

apaytuvi commented 8 years ago

The problem was that it does not accept a point in the database name. So, changing --database cellbase_slycopersicum_sl2.50 by --database cellbase_slycopersicum works. However, I can't access to the genome sequence:

/cellbase/webservices/rest/v3/slycopersicum/genome/sequence

{"apiVersion":"v2","warning":"","error":"javax.ws.rs.NotFoundException: HTTP 404 Not Found","queryOptions":{},"response":[{"id":"","time":0,"dbTime":-1,"numResults":-1,"numTotalResults":-1,"warningMsg":"Future errors will ONLY be shown in the QueryResponse body","errorMsg":"DEPRECATED: javax.ws.rs.NotFoundException: HTTP 404 Not Found","resultType":"","result":[]}]}

I also tried different database names such as cellbase_slycopersicum_sl2-50_v3.

javild commented 8 years ago

Hi @apaytuvi The database name must be formed as: cellbase_<short_species_name>_<assembly>_<v3>

where: <short_species_name> in your case would be, as you indicated, slycopersicum <assembly> must be the assembly without hyphens, dots, underscores or special symbols. In your case could be sl250 <version> the CellBase data version. If I do remember well you built the release/v4.0.0 code which means that you must use v4 here - I strongly recommend you to use de release/v4.0.0 code, since there are many fixes, new features and documentation much improved since the last 3.2 stable release. The release/v4.0.0 branch will become the next v4.0 stable release in few days.

Moreover, the database name must be as I just described since the server code will automatically generate the name when trying to connect to the database by joining those for strings (cellbase, <short_species_name>, <assembly>,<version>). We're aware that the user is allowed to provide a "custom" database name through the loading CLI, but that actually just one database name is valid and that could be automatically generated by the loader as well - will be improved in the future. Actually this is not properly indicated in the documentation.

The server will get the assembly identifier from the CellBase configuration.json file: https://github.com/opencb/cellbase/blob/release/v4.0.0/cellbase-core/src/main/resources/configuration.json meaning that if you want to use a different assembly from the one indicated at the configuration.json, you must edit the file. For your species of interest, current assembly identifier is GCA_000188115.2:

https://github.com/opencb/cellbase/blob/release/v4.0.0/cellbase-core/src/main/resources/configuration.json#L862

Finally, if you are using release/v4.0.0 you must use "v4" version in the url (you are getting a 404 Not found error). That is: /cellbase/webservices/rest/v4/slycopersicum/genome/sequence

You should be able to see the API specification at /cellbase/webservices

Summaryzing:

apaytuvi commented 8 years ago

Thanks @javild

I do use the release/v4.0.0 and my database is cellbase_slycopersicum_gca0001881152_v4. However, I can't get anything from the REST API. For example, by going to: /cellbase/webservices/rest/v4/slycopersicum/genomic/chromosome/list, I get:

{"apiVersion":"v4","warning":"","error":"","queryOptions":{"exclude":["_id","_chunkIds"],"metadata":true,"include":null,"limit":1000,"skip":-1,"count":false},"response":[{"time":0,"dbTime":2,"numResults":0,"numTotalResults":0,"resultType":"","result":[]}]}

I do not know if anything is wrong, since when I go to /cellbase/webservices/, at the bottom of the page, it shows:

[ base url: /cellbase/webservices/rest , api version: 3.2.0 ]

API version 3? I am using the release/v4.0.0. I also can see an error button and, when I click it, it shows me:

{"schemaValidationMessages":[{"level":"error","domain":"validation","keyword":"oneOf","message":"instance failed to match exactly one schema (matched 0 out of 2)","schema":{"loadingURI":"http://swagger.io/v2/schema.json#","pointer":"/definitions/parametersList/items"},"instance":{"pointer":"/paths/~1{version}~1{species}~1feature~1clinical~1search/get/parameters/9"}},{"level":"error","domain":"validation","keyword":"oneOf","message":"instance failed to match exactly one schema (matched 0 out of 2)","schema":{"loadingURI":"http://swagger.io/v2/schema.json#","pointer":"/definitions/parametersList/items"},"instance":{"pointer":"/paths/~1{version}~1{species}~1feature~1clinical~1search/get/parameters/10"}},{"level":"error","domain":"validation","keyword":"oneOf","message":"instance failed to match exactly one schema (matched 0 out of 2)","schema":{"loadingURI":"http://swagger.io/v2/schema.json#","pointer":"/definitions/parametersList/items"},"instance":{"pointer":"/paths/~1{version}~1{species}~1feature~1clinical~1search/get/parameters/11"}},{"level":"error","domain":"validation","keyword":"oneOf","message":"instance failed to match exactly one schema (matched 0 out of 2)","schema":{"loadingURI":"http://swagger.io/v2/schema.json#","pointer":"/definitions/parametersList/items"},"instance":{"pointer":"/paths/~1{version}~1{species}~1feature~1clinical~1search/get/parameters/12"}},{"level":"error","domain":"validation","keyword":"oneOf","message":"instance failed to match exactly one schema (matched 0 out of 2)","schema":{"loadingURI":"http://swagger.io/v2/schema.json#","pointer":"/definitions/parametersList/items"},"instance":{"pointer":"/paths/~1{version}~1{species}~1feature~1clinical~1search/get/parameters/13"}},{"level":"error","domain":"validation","keyword":"oneOf","message":"instance failed to match exactly one schema (matched 0 out of 2)","schema":{"loadingURI":"http://swagger.io/v2/schema.json#","pointer":"/definitions/parametersList/items"},"instance":{"pointer":"/paths/~1{version}~1{species}~1feature~1clinical~1search/get/parameters/14"}},{"level":"error","domain":"validation","keyword":"oneOf","message":"instance failed to match exactly one schema (matched 0 out of 2)","schema":{"loadingURI":"http://swagger.io/v2/schema.json#","pointer":"/definitions/parametersList/items"},"instance":{"pointer":"/paths/~1{version}~1{species}~1feature~1clinical~1search/get/parameters/15"}},{"level":"error","domain":"validation","keyword":"oneOf","message":"instance failed to match exactly one schema (matched 0 out of 2)","schema":{"loadingURI":"http://swagger.io/v2/schema.json#","pointer":"/definitions/parametersList/items"},"instance":{"pointer":"/paths/~1{version}~1{species}~1feature~1clinical~1search/get/parameters/16"}},{"level":"error","domain":"validation","keyword":"oneOf","message":"instance failed to match exactly one schema (matched 0 out of 2)","schema":{"loadingURI":"http://swagger.io/v2/schema.json#","pointer":"/definitions/parametersList/items"},"instance":{"pointer":"/paths/~1{version}~1{species}~1feature~1clinical~1search/get/parameters/17"}},{"level":"error","domain":"validation","keyword":"oneOf","message":"instance failed to match exactly one schema (matched 0 out of 2)","schema":{"loadingURI":"http://swagger.io/v2/schema.json#","pointer":"/definitions/parametersList/items"},"instance":{"pointer":"/paths/~1{version}~1{species}~1feature~1clinical~1search/get/parameters/18"}}]}
javild commented 8 years ago

Don't worry about the "api version: 3.2.0" message at the bottom. It's hardcoded -we'll fix it- and as long as you have deployed the cellbase.war corresponding to the release/v4.0.0 branch it's perfectly fine.

It's connecting to the mongo server, otherwise it'd be raising an error. Therefore, it seems to be that either the database name is not correct or the collection genome_info is empty. Please, connect to the mongodb database and do:

> db.genome_info.find()
{ "_id" : ObjectId("576d561922cfd5386301eba0"), "species" : "Homo sapiens", "supercontigs" : [ ], "chromosomes" : [ { "cytobands" : [ { "stain" : "acen", "name" : "p11", "end" : 26200000, "start" : 24200001 }, { "stain" : "gvar", "name" : "p12", "end" : 24200000, "start" : 19900001 }, { "stain" : "gneg", "name" : "p13.11", "end" : 19900000, "start" : 16100001 }, { "stain" : "gpos25", "name" : "p13.12", "end" : 16100000, "start" : 13800001 }, { "stain" : "gneg", "name" : "p13.13", "end" : 13800000, "start" : 12600001 }, { "stain" : "gpos25", "name" : "p13.2", "end" : 12600000, "start" : 6900001 }, { "stain" : "gneg", "name" : "p13.3", "end" : 6900000, "start" : 1 }, { "stain" : "acen", "name" : "q11", "end" : 28100000, "start" : 26200001 }, { "stain" : "gvar", "name" : "q12", "end" : 31900000, "start" : 28100001 }, { "stain" : "gneg", "name" : "q13.11", "end" : 35100000, "start" : 31900001 }, { "stain" : "gpos25", "name" : "q13.12", "end" : 37800000, "start" : 35100001 }, { "stain" : "gneg", "name" : "q13.13", "end" : 38200000, "start" : 37800001 }, { "stain" : "gpos25", "name" : "q13.2", "end" : 42900000, "start" : 38200001 }, { "stain" : "gneg", "name" : "q13.31", "end" : 44700000, "start" : 42900001 }, { "stain" : "gpos25", "name" : "q13.32", "end" : 47500000, "start" : 44700001 }, { "stain" : "gneg", "name" : "q13.33", "end" : 50900000, "start" : 47500001 }, { "stain" : "gpos25", "name" : "q13.41", "end" : 53100000, "start" : 50900001 }, { "stain" : "gneg", "name" : "q13.42", "end" : 55800000, "start" : 53100001 }, { "stain" : "gpos25", "name" : "q13.43", "end" : 58617616, "start" : 55800001 } ], "name" : "19", "isCircular" : 0, "size" : 58617616, "end" : 58617616, "start" : 1 }, { "cytobands" : [ { "stain" : "acen", "name" : "p11.1", "end" : 28100000, "start" : 25700001 }, { "stain" : "gneg", "name" : "p11.21", "end" : 25700000, "start" : 22300001 }, { "stain" : "gpos......
javild commented 8 years ago

There has been no movement on this issue for quite a number of days - closing it May be reopened if necessary

apaytuvi commented 8 years ago

The problem was that I did not have the genome_info collection. Thanks @javild !