Closed frapit closed 2 years ago
Hmm, I haven't seen this error before... Let's try a few things to see if I can figure out what is causing the error.
It is not really getting to the "new" bag files. The database has a couple of thousand bagfiles and usually when we added new bagfiles to the folder (not via the web interface) then the new bag was added after some time. Today I noticed that since the last update (10 days ago) no new bags where added because it stays forever in the "Removing missing bag entries" state. The above error gets thrown for many of the "old" bags (~100).
There are also some other bagReaderExceptions like
bagdb_1 | 2021-09-10 13:49:54.685 [pool-2-thread-1] DEBUG c.g.swrirobotics.bags.BagService - Checking /bags/zz/abc.bag...
bagdb_1 | 2021-09-10 13:49:54.687 [pool-2-thread-1] ERROR c.g.swrirobotics.bags.BagService - Error reading bag file.
bagdb_1 | com.github.swrirobotics.bags.reader.exceptions.BagReaderException: Expected to read 13 bytes but only got -1.
or
bagdb_1 | 2021-09-10 13:49:54.656 [pool-2-thread-1] INFO c.g.swrirobotics.bags.BagService - Adding new bag: /bags/verylongbagnameverylongbagnameverylongbagnameverylongbagnameverylongbagnameverylongbagname.bag
bagdb_1 | 2021-09-10 13:49:54.680 [pool-2-thread-1] WARN o.h.e.j.spi.SqlExceptionHelper - SQL Error: 0, SQLState: 22001
bagdb_1 | 2021-09-10 13:49:54.681 [pool-2-thread-1] ERROR o.h.e.j.spi.SqlExceptionHelper - ERROR: value too long for type character varying(100)
bagdb_1 | 2021-09-10 13:49:54.684 [pool-2-thread-1] ERROR c.g.s.b.s.f.FilesystemBagStorageImpl - Unexpected error updating bag file:
I thought the exception would not cause the file watcher to stay in an error state. I can try to remove all the old bags which produce an error but this will be some work.
It's probably worth cleaning up the other bag files that are causing errors, although I don't think those would cause the file scanner to stop working. Just for reference:
bagdb_1 | com.github.swrirobotics.bags.reader.exceptions.BagReaderException: Expected to read 13 bytes but only got -1.
I've seen this happen if a bag file is empty or of the header is corrupted, so you should check to see if the bag that produce this error are actually valid.
For the specific bag file that it thinks is missing:
bagdb_1 | 2021-09-10 13:50:24.568 [pool-2-thread-1] DEBUG c.g.swrirobotics.bags.BagService - Checking /bags/abc/def/2019-11-27-18-49-10.bag...
bagdb_1 | 2021-09-10 13:50:24.599 [pool-2-thread-1] DEBUG c.g.swrirobotics.bags.BagService - Calculated bag md5sum: 0156736fc0eab18169e59d1d640cf403
bagdb_1 | 2021-09-10 13:50:24.994 [pool-2-thread-1] ERROR org.hibernate.AssertionFailure - HHH000099: an assertion failure occurred (this may indicate a bug in Hibernate, but is more likely due to unsafe use of the session): org.hibernate.AssertionFailure: null id in com.github.swrirobotics.persistence.Bag entry (don't flush the Session after an exception occurs)
Does this bag exist inside the bagdb_1 docker container at /bags/abc/def/2019-11-27-18-49-10.bag
? You might try using docker exec
to get a shell inside the container and verify that it exists and you can read the file. I'll test it a little bit to see if I can reproduce this error, although a workaround for now might be to just catch that exception so it doesn't stop the file scanner...
I checked for some of the bagfiles throwing the null id
error and the file exists inside the container. Outside I can also run rosbag info
without problems. For the other errors Expected to read 13 bytes
the files are indeed corrupt.
Thanks for your help!
EDIT
One more thing, the only error popping up in the UI Error log is Unable to read... Check its permissions.
Checking these on the disk show that they are corrupt as well and need to be reindexed, the docker logs show some chunk error and a final "ERROR c.g.swrirobotics.bags.BagService - Error reading bag file.
So it seems the null id
error is the only one where working bagfiles are erroneously flagged as defect. I will try to sort the other ones out.
Some updates on this:
So apparently this "null id" error is not necessarily related to specific bagfiles.
I just reverted to an older container version (bagdb 3.2.0) and now it works again and indexes all my files (including all new bags) correctly.
I've tried a little bit, but so far haven't had any luck replicating this locally. Would it be possible for you to share your docker-compose.yml file (with passwords removed, of course) so I can compare it to mine?
Sure but I think there is almost no difference to the example files you provide in the docs. Here is the version where the indexer stops. When I comment the other image in, it works.
version: '3.6'
services:
docker:
image: docker:dind
privileged: yes
networks:
- bagdb
restart: unless-stopped
volumes:
- bags:/bags:ro # Needs to match the path in the bagdb container
- scripts:/scripts
- docker_cache:/var/lib/docker
command: ["dockerd", "--host=tcp://0.0.0.0:2375"]
bagdb:
#image: swrirobotics/bag-database:latest
image: ghcr.io/swri-robotics/bag-database/bag-database:latest
networks:
- bagdb
- web
depends_on:
- postgres
restart: unless-stopped
labels:
- "traefik.enable=true"
- "traefik.port=80"
- "traefik.http.routers.bagdb.entrypoints=websecure"
- "traefik.http.routers.bagdb.tls=true"
- "traefik.http.routers.bagdb.rule=PathPrefix(`/bagdb`)"
- "traefik.http.routers.bagdb.priority=100"
volumes:
- bags:/bags # Replace this with the path to your bag files
- indexes:/root/.ros-bag-database/indexes
- scripts:/scripts
environment:
ADMIN_PASSWORD: "XXXXXXXXXX"
BAGDB_PATH: "bagdb"
DB_DRIVER: org.postgresql.Driver
DB_PASS: XXXXXXXXXX
DB_URL: "jdbc:postgresql://postgres/bag_database"
DB_USER: bag_database
DOCKER_HOST: "http://docker:2375"
GPS_TOPICS: "/localization/gps, gps, /vehicle/gps/fix, /localization/sensors/gps/novatel/raw, /localization/sensors/gps/novatel/fix, /imu_3dm_node/gps/fix, /local_xy_origin"
METADATA_TOPICS: "/metadata"
VEHICLE_NAME_TOPICS: "/vms/vehicle_name, /vehicle_name"
OPEN_WITH_URLS: "{'Webviz':['https://mss.mydomain.org/webviz/?', 'remote-bag-url']}"
postgres:
image: postgis/postgis:11-2.5
networks:
- bagdb
restart: unless-stopped
volumes:
- postgres:/var/lib/postgresql/data
ports:
- "5432:5432"
environment:
POSTGRES_PASSWORD: XXXXXXXXXX
POSTGRES_USER: bag_database
POSTGRES_DB: bag_database
webviz:
image: cruise/webviz
networks:
- bagdb
- web
volumes:
- ./webviz.conf:/etc/nginx/conf.d/default.conf:ro
labels:
- "traefik.enable=true"
- "traefik.port=80"
- "traefik.http.routers.webviz.rule=PathPrefix(`/webviz`)"
- "traefik.http.routers.webviz.entrypoints=websecure"
- "traefik.http.routers.webviz.tls=true"
- "traefik.http.routers.webviz.priority=100"
- "traefik.http.services.webviz.loadbalancer.server.port=8080"
volumes:
bags:
driver: local
driver_opts:
type: 'none'
o: 'bind'
device: '/mnt/bags'
docker_cache:
postgres:
ldap:
slapd:
indexes:
scripts:
driver_opts:
type: 'tmpfs'
device: 'tmpfs'
networks:
web:
external: true
bagdb:
internal: true
I have the same error while trying to seed the database with around 100 bags (v3.4.0, nearly same docker-compose.yml). It's probable some of them are corrupt but I think it should be better, if possible, to just identify them for feedback but ignore them to allow the process to finish properly. One other issue that maybe linked, bags importation seems to run in a transaction wich mean that if something break during the process all the work is lost.
Sorry for taking so long on this, although I think I've got a fix for the issue; check #161 for details. If you'd like to test it, I pushed a docker image to ghcr.io/hatchbed/bag-database:v3.4.1
that has the fix in it.
@frapit if this solves your issue, please let me know and I'll merge the PR and close the issue.
@pjreed Thanks for answering so soon, I pulled your image and restarted the process. I'm going to let it run for the night, I'll tell you tomorrow if it fixes my issues.
Thank you for the quick fixup! I pulled the image and I can at least confirm partial success: No errors are thrown in the logs and the status line seems to indicate that everything is ready. (No more Bag Service: Removing missing bag entries
) That indicates that the faulty bags are successfully caught by your routine.
However, I am not entirely sure if the BagController is in the correct state, because when I add a new bag to the mapped folder it is still not added to the db. The new folder is immediately detected in the folder view but not the bag in it. When I do a "Force Bag Scan" nothing happens except the corresponding INFO c.g.s.admin.AdminController - forceScan
in the logs. I can handle this problem by restarting the container every night or so, then all new bags are added. I am not sure if this relates to the original issue in any way, so the linked pull request is definitely an improvement and should be merged @danthony06.
@frapit If you keep having issues with that BagController, could you open a separate issue?
@frapit So, just to make sure I understand what's going on -- clicking the "Force Bag Scan" does not pick up new bags, but restarting the Bag Database does, and in both cases, it doesn't print any exceptions in the log?
When I restart the container, I get all the Is /bags/test/test.bag missing? false
prints for all bags. Instead, when I do a "Force Bag Scan" the debug logs only show a INFO c.g.s.admin.AdminController - forceScan
and nothing else happens. The filewatcher does not pick up new bags either.
Aha, I think there's actually a bug in the AdminController (probably caused by the recent data storage refactoring) that is causing it to not properly iterate through all of the storage backends when you try to force an update. I can fix that easily...
That wouldn't explain while the file watcher isn't picking up any new bags, though. What kind of filesystem are your bags stored on?
I went ahead and pushed another image to ghcr.io/hatchbed/bag-database:v3.4.2
that should fix the buttons to force bag scanning and opened another PR, #162 .
Sorry for the delay. Thank you, the PR fixes the force bag scan functionality. The file watcher issue is on my side I guess, the bags are mounted via cifs, for which the linux kernel apparently has no inotify support. Didn't notice before as I was restarting every night anyway, so I only tested after the bagdb didn't pick up changes for many days. Maybe one could expose the scan functionality to the API so that it could be triggered automatically once in a while but I am quite happy with the current state. Thank you again for the fast response and the great fixes!
Ok, that makes sense. There isn't a way to get filesystem notifications for CIFS mounts, so I would say that's not a bug. It might make sense to add an option to periodically scan for new bags in case it's using a filesystem that doesn't have inotify support. (in fact, the S3 storage backend has to do that)
The methods in the AdminController are actually accessible through the API, although they're not in the documentation... I should probably add documentation for those. To force a scan, you can make a POST request to /admin/forceScan
after authenticating with the admin password.
Perfect, that is good to know. I will go for the API request then. The only thing I was still wondering about is that the bagdb immediately picks up newly created folders, when opening the folder view. But I guess that you are iterating over the folders on the filesystem explicitly every time the view is rendered but the bags are collected from the internal representation, which explains it. Thank you for your support!
Hi,
I am not sure if this is because I recently update from 3.2. to the latest version but the database stopped to index any new bags from that day. I restarted the service multiple times, I triggered a rescan force via admin panel and curl but it always gets stuck at some point before..
The Status line is always indicating
Bag Service: Removing missing bag entries
and it seems to get stuck during that operation. For some bags I have errors in the UI:Unable to read /bags/xy/somefile.bag. Check its permissions.
I don't see any permission errors but I don't think that is the main cause.In the docker compose logs I get multiple execptions of the following, I think they are stopping the file watcher in some way.
I am not sure what to do with this information of a
null id
, is this some database corruption? Should I delete it and start over? Recreating everything takes many hours and I am not sure if it is worth it.Thank you for any hint!