Open grahamc opened 7 years ago
Ok, so I crossed-out the part about not updating the database.
I've seen this happen at least once:
+ curl --basic -u blake:admin:first -X POST -d pool=DATABASE -d address=10.50.2.10 -d gateway=10.50.2.1 -d netmask=255.255.255.0 http://localhost:9000/api/asset/mysql-container-10/address
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
[158B blob data]
{"status":"error","data":{"message":"Unable to update address: ...
(but didn't catch the http response code...)
but
@grahamc have you only been able to repro using the built-in H2 database, or with mysql as well?
I have also reproduced with mysql.
I'll need to allocate time to dig into the root cause here. Thanks for the report!
Sounds good. Let me know if I can help, likely soon I'll be able to open source the code used to cause the issue in the first place.
need to add transactional reads when verifying that a meta attribute exists
https://github.com/tumblr/collins/blob/master/conf/evolutions/collins/1.sql#L32
But this shouldn't be an issue in practice. Collins does create meta fields when they don't exist during asset creation. In practice this happens rarely. If you create an initial asset so that meta fields are created and then run any number in intakes at the same time, it won't be an issue.
I'd go to the extent of suggesting that this be considered a known issue, or implemented via a flag, blocking asset intake on account of validating asset meta existence has implication beyond the cases when asset meta needs to be created.
@maddalab does it need to be any asset, or a server asset in particular?
Hmm... @maddalab I gave it a try, loading in a server nearly identical to the rest of my assets, then once that finished doing parallel intake of many servers. However, I still found the problem.
Here is the script I'm running prior to intake:
URL="http://localhost:9000/api/asset"
TAG="bogus-server"
LLDP_FILE="single.lldp"
LSHW_FILE="virident.lshw"
curl --basic -X PUT -H "Accept: text/plain" \
-u blake:admin:first "$URL/$TAG"
curl --basic -H "Accept: text/plain" -u blake:admin:first \
--data-urlencode "lldp@$LLDP_FILE" \
--data-urlencode "lshw@$LSHW_FILE" \
--data-urlencode 'CHASSIS_TAG=bogus-server' \
"$URL/$TAG"
and then I run this for several nodes simultaneously (this is for #12, but I run this for 10, 11, and 12 in this case):
URL="http://localhost:9000/api/asset"
TAG="mysql-container-12"
LLDP_FILE="single.lldp"
LSHW_FILE="virident.lshw"
curl --basic -X PUT -H "Accept: text/plain" \
-u blake:admin:first "$URL/$TAG"
curl --basic -H "Accept: text/plain" -u blake:admin:first \
--data-urlencode "lldp@$LLDP_FILE" \
--data-urlencode "lshw@$LSHW_FILE" \
--data-urlencode 'CHASSIS_TAG=Testing this' \
"$URL/$TAG"
collins modify -C "collins-cli.yaml" -t "mysql-container-12" -S maintenance:maint_noop -r "Provisioning"
collins provision -C "collins-cli.yaml" -t "mysql-container-12" -n databasenode -r DATABASE -b blake
collins modify -C "collins-cli.yaml" -t "mysql-container-12" \
-S Allocated:SPARE \
-r "Provisioning" \
-a "HOSTNAME:mysql-container-12"
curl --basic -u blake:admin:first -X POST \
-d pool=DATABASE \
-d address=10.50.2.12 \
-d gateway=10.50.2.1 \
-d netmask=255.255.255.0 \
http://localhost:9000/api/asset/mysql-container-12/address
should this have fixed it? Here is the log, continuing to demonstrate the problem:
In a test environment, I have setup several test assets that I "intake" in parallel. This can produce stack traces and errors in provisioning:
Trace 1:
``` 2017-02-01 15:58:52,793 - [WARN] - o.a.s.c.SolrCore - o.a.s.c.SolrCore:getSearcher:1761 - [collins] PERFORMANCE WARNING: Overlapping onDeckSearchers=2 2017-02-01 15:58:54,146 - [WARN] - application - c.m.AssetLifecycle:collins$models$AssetLifecycle$$handleException:303 - Error saving attributes for asset java.lang.RuntimeException: Exception while executing statement : Unique index or primary key violation: "CONSTRAINT_9D9B1_INDEX_F ON PUBLIC.ASSET_META(NAME) VALUES (CAST('SUFFIX' AS VARCHAR_IGNORECASE), 48)"; SQL statement: insert into asset_meta (name, priority, value_type, description, label) values (?,?,?,?,?) [23505-188] errorCode: 23505, sqlState: 23505 insert into asset_meta (name, priority, value_type, description, label) values (?,?,?,?,?) jdbcParams:[SUFFIX,-1,1,SUFFIX,Suffix] at org.squeryl.internals.DatabaseAdapter$class._exec(DatabaseAdapter.scala:297) ~[org.squeryl.squeryl_2.11-0.9.5-7.jar:0.9.5-7] at org.squeryl.internals.DatabaseAdapter$class.exec(DatabaseAdapter.scala:349) ~[org.squeryl.squeryl_2.11-0.9.5-7.jar:0.9.5-7] at org.squeryl.adapters.H2Adapter.exec(H2Adapter.scala:22) ~[org.squeryl.squeryl_2.11-0.9.5-7.jar:0.9.5-7] at org.squeryl.internals.DatabaseAdapter$class.executeUpdateForInsert(DatabaseAdapter.scala:383) ~[org.squeryl.squeryl_2.11-0.9.5-7.jar:0.9.5-7] at org.squeryl.adapters.H2Adapter.executeUpdateForInsert(H2Adapter.scala:22) ~[org.squeryl.squeryl_2.11-0.9.5-7.jar:0.9.5-7] Caused by: org.h2.jdbc.JdbcSQLException: Unique index or primary key violation: "CONSTRAINT_9D9B1_INDEX_F ON PUBLIC.ASSET_META(NAME) VALUES (CAST('SUFFIX' AS VARCHAR_IGNORECASE), 48)"; SQL statement: insert into asset_meta (name, priority, value_type, description, label) values (?,?,?,?,?) [23505-188] at org.h2.message.DbException.getJdbcSQLException(DbException.java:345) ~[com.h2database.h2-1.4.188.jar:1.4.188] at org.h2.message.DbException.get(DbException.java:179) ~[com.h2database.h2-1.4.188.jar:1.4.188] at org.h2.message.DbException.get(DbException.java:155) ~[com.h2database.h2-1.4.188.jar:1.4.188] at org.h2.index.BaseIndex.getDuplicateKeyException(BaseIndex.java:102) ~[com.h2database.h2-1.4.188.jar:1.4.188] at org.h2.mvstore.db.MVSecondaryIndex.checkUnique(MVSecondaryIndex.java:231) ~[com.h2database.h2-1.4.188.jar:1.4.188] 2017-02-01 15:58:54,795 - [WARN] - o.a.s.c.SolrCore - o.a.s.c.SolrCore:getSearcher:1761 - [collins] PERFORMANCE WARNING: Overlapping onDeckSearchers=2 ```Unique index or primary key violation
**CLICK TO EXPAND**Trace2:
``` 2017-02-01 16:17:35,551 - [WARN] - application - c.m.AssetLifecycle:collins$models$AssetLifecycle$$handleException:303 - Error saving attributes for asset java.util.NoSuchElementException: None.get at scala.None$.get(Option.scala:347) ~[org.scala-lang.scala-library-2.11.7.jar:na] at scala.None$.get(Option.scala:345) ~[org.scala-lang.scala-library-2.11.7.jar:na] at collins.models.AssetMetaValue.meta$lzycompute(AssetMetaValue.scala:32) ~[collins.collins-2.0-SNAPSHOT.jar:2.0-SNAPSHOT] at collins.models.AssetMetaValue.meta(AssetMetaValue.scala:32) ~[collins.collins-2.0-SNAPSHOT.jar:2.0-SNAPSHOT] at collins.models.AssetMetaValue.java.util.NoSuchElementException: None.get
**CLICK TO EXPAND**~Interestingly, it doesn't seem to result in an error to the client but also does not correctly write the data to the database.~ (See first comment)
I am running the docker container with (mostly) standard / default configurations under systemd, and have 7 units (#s 10-16) looking like this:
Unfortunately, since it doesn't seem to fail in the response, I'm not sure what step failed.
I was able to reproduce these issues in about 9 out of 10 test runs, and can share with you instructions on how to reproduce what I'm seeing (using tooling which is not public at this time.)