Testing indexed and ingested data

hadisinaee commented 7 months ago

Given #20, I believe it is crucial to conduct correctness testing for data produced by our indexer and ingester. We may want to test them separately since they represent two distinct and independent phases of Indaleko.

The tests for the indexer should include the following:

Capturing the correct number of folders and files in the indexed path
Ensuring the accurate relationships between folders and these files

For the ingesters, we can apply the same tests for relationships and the number of files and folders. However, considering that we are adding metadata to edges and vertices, we might need to include additional tests related to metadata. While I don't have a concrete plan for that at the time of creating this issue, a possible approach could be to create a test script where we establish a test folder structure with random files and folders (which could also be a fixed structure).

fsgeek commented 7 months ago

How do you define "correct" in this case. The data sets we are indexing are, in fact, dynamic in nature, which does complicate things a bit.

For example, you say "capturing the correct number of folders and files in the indexed path" but when I review the logs from my runs I see files that could not be indexed because the stat operation returned bogus parameters. My analysis is that this was due to the Python implementation masking errors (like "access denied") so the returned data was bogus. My approach was to log the error and move on. This means those files were not included in the count.

For the relationship, ignoring the dynamic nature of the data, building an ingester that computes checksums would be sufficient. A more general question is "how do we validate this for cloud storage indexing and ingesting?"

We could, for example, add logic in the local ingester to verify that we can open the file using the data from the indexer. Validating the ingester output isn't as clear to me - are we validating the normalization against the output file or are we validating what gets inserted into the database, or both?

Do we use the indexer input file to validate the ingester output file, or do we build a different indexer and have it re-walk the tree.

Finally, is this an important enough problem that we can't take measurements without fixing it?

hadisinaee commented 7 months ago

Sure, all these questions are important and valid, and right now, I don't have answers for them—that's why I opened this issue. If we only rely on successful indexing and data ingestion (meaning no critical errors anywhere) to ADB, we might end up in the wrong place.

For example, I made a mistake in a previous version of my ingester (I messed up finding the relationships), but I could still index and ingest the data. The question of correctness in this case could be: "Was the relationship as it has to be?" The answer is no. So, even though we ingested the data correctly, what ended up in the database was wrong. We all can make a mistake somewhere in our code, and as the code base grows, the probability of having a bug increases.

So, my main points revolve around this. One possible solution is to run a few tests after ingesting data to make sure that what's in the database is exactly what we wanted in the right place and order. We can figure out a way to do it. But first I just want to make sure we are on the same page about why we need it. Then we can figure out what and how parts.

If you think we can postpone this issue to later or if it is not necessary, we can move forward. It was just a question that popped out in my head.

ubc-systopia / Indaleko

Testing indexed and ingested data #21