mar-file-system / GUFI

Grand Unified File-Index
Other
46 stars 23 forks source link

Longitudinal Study: Phase 2 #150

Open johnbent opened 10 months ago

johnbent commented 10 months ago

Once we have #149 completed, we'll need to develop a process by which we capture a longitudinal snapshot from the GUFI tree. There is both an initial snapshot process and there is potentially an incremental snapshot process. For example, here is one proposal:

  1. Create an initial longitudinal snapshot by doing a roll-up of all the vrsummary tables into a single vrsummary table which includes a depth column and a pinode column. 1.1. However, maybe this is not part of the longitudinal snap but is part of the GUFI tree build
  2. Create a copy of this single augmented rolled-up summary table (SARUST).
  3. At some future time, create a new copy of the SARUST.
  4. Potentially reduce savings costs by reducing one of the copies into an increment off the other 4.1. This should be a lot of savings since there may be 1B entries in the table (1B directories in a GUFI tree) but only a very small number of them will have changes between longitudinal snapshots

    So, in this issue, we can define how we capture the initial snapshot as well as how we create the increment.

johnbent commented 10 months ago

Gary says,

"I have been thinking about how to speed up the yesterday / today compare

One way would be brute force 100M join on the entire record. Sounds expensive.

I am thinking we split it up via inode of the directory ( just for the determining the differences. Do one pass ( very fast ) to get highest and lowest inode. Split the space to parallelize on inode, maybe into 100 parts or something so it’s like a 1M row per chunk.
Then still against the big flat tables from yesterday and today, do joins at the chunk level, with separate threads. And determine the diffs and then con at the diffs.

It could be as simple as Dump table yesterday Dump table today Sort each with Unix sort Break them up by inode Then do sort merge for each chunk producing output ..

Or you could use SQL to do it as well using similar technique using output dbs or an interesting with recurse on inode range.

Just worried a bit about a 100M to 101M row join of a very long concatenated key.

We could do the join on just the fsid.inode but that depends on the fs not reusing inodes for a long time and that sounds fragile."

johnbent commented 10 months ago

This might decide a bit on how we capture the longitudinal snapshot. The GUFI summary tables will contain histograms about the files in each directory as well as info about the directories themselves. To capture a longitudinal snapshot, couldn't we simply select the subset of directories with something:

select * from rolledup_vrsummary_table where atime > last_snapshot_ts [AND ctime and mtime] or entries_max_atime > last_snapshot [AND ctime and mtime]

ndjones commented 2 months ago

apologies @johnbent if this isn't the right place, happy to be redirected. I've come across your threads on longitudinal study, and rolling up GUFI indexes to achieve a snapshot. I've been pondering a similar idea - the GUFI tool as it is is poweful and impressive. The power of it is of interest to a wider audience than many of the storage analysis tools available to our system administrators. one thing that limits wider access is the limited means of using SQL to work with GUFI. Passing SQL queries into GUFI via a command parameter restricts the exploratory analysis of GUFI data, in the sense of having only basic tooling available to explore datasets and develop queries of interest. On my mind is a post processing step of combining all index files into one database, which is then a first class database available for integration with other toolsets - for example, one could add Redash https://redash.io/ or any number of tools to such a database, build reports, charts, dashboards, and provide a much higher level of access and potential for valuing the insights possible from GUFI. now I'm not sure this is quite what you've got in mind, and ultimately it doesn't resolve the challenges in how you do combine the data into a snapshot, and allow for multiple snapshots to be maintained over time, into an historical longitudinal record. it perhaps suggests that having that raw data consolidated into one single database could be of higher value, and less initial work as it would only need post processing into an appropriate data structure that maintained the temporal dimension and resolved the folder hierarchy while maintaining all other indexed attributes. again, i'm probably well off track, so pls gentle steer away me if so :o).

garygrider commented 2 months ago

One of the hallmarks of gufi is that it is not a monolithic database, because in order to allow users to utilize the database you would have to make it obey posix rules, not just simple access to file names/stat info/etc. but all the inheritance parts of posix, where you can change a bit somewhere high in the tree and access to info is disallowed below there for some users, etc. We do combine info into rolled up databases but only if all users have same access to that rolled up database, etc.

We do accommodate writing out output of queries (even very large queries) into a single output db, or could do csv or parquet or whatever pretty trivially it needed, so if you wanted to do a query of all the files/dirs you have access to (as a normal user), including all the system info, user supplied databases, dsi provided databases, etc. that we have attached (securely) for you. So you could do an extremely complex thing like all the files/dirs. you have access to, with any where clause on the system info but joined with a where clause from some external database you have asked us to attach on a per directory basis, and other databases someone has provided to attach to, all joined on file or dir name or …. With group by/order by/sum/avg…. (essentially you can write all the sql your heart desires). You can create the schema of the output database as you wish and populate that output database basically by using insert into ….. its incredibly powerful and it only lets you see what you can see as that user. Once you have that output db you could pretty trivially use convert format or use whatever your favorite tool is.

We don’t have a tool to save a query but of course since its just text, that’s pretty trivial. We could add a stored query feature if that would be helpful but you can do it yourself however you want trivially.

We didn’t want to get into the business of combining info beyond what can this user see at the time of the query. After that its on the user to manage sharing that info in any way they choose.

Now for system/storage admin users that have access to the entire tree, one could do the same thing but the output db could be huge, depending on the query you run of course.

But your question seems to be about the longitudinal info, which is studying trends of the tree(s) over time. We don’t think we can afford to store the entire tree info daily.

We do keep a few days of the trees for “safe keeping” which I am sure we could make access available to sys/stg admin type people.

The idea for the longitudinal was to figure out what information could be “summarized” on like a daily basis and kept somewhere, I assume in a single database as there is no security reason to keep the info in separate databases. We would add a date snapped like field. I think the result was to more or less take the directory summary records (each directory has a record that summarizes everything about that directory (number of files, histogram of file sizes, ages, modes… ). We will probably toss the file records and maybe add even more info to the directory summaries. There is also a tree summary record that summarizes everything in the tree below – but we could recreate that from the dir summaries I suppose.

I don’t know what all we decided to put into the longitudinal records other than we wont keep the file records.

I don’t know what the format of the resulting single database will be, probably sqlite, but it would be trivial to again just run a scan type query (maybe by date range of interest) and get an output database in whatever common output, like csv or parquet. Parquet might be the most interesting because then you can use any of the apache analytics tools you desire to do analysis, keep your own data lake of these records, whatever. Maybe John should provide the schema for the longitudinal db and you could suggest other things we may have forgotten.

I think this will get you what you are talking about. The analytics community seems to be filled with snowflakes so we wanted to just make a way for people to get the data in some reasonable forms and then they can do what they want with it. Storage admins would care about tree growth (bushy, grassy, patterns of of growth and shrink, …) security folks might care about sharing patterns (mode bits/changes), etc.

It wont have accesses. It might have access time histograms if the file system supports atime without a denial of service.

System admins, well I don’t know what they would want, maybe if some file had changed – which we wont keep for very long but like I said we do keep some number of days and maybe some scheme like last 5 days, then a single copy 10 days and 10 and 30, something that is manageable. Of course sysadmins care about file systems that aren’t in gufi for the most part I suppose. One could use gufi to track changes to root file systems and stuff like that I suppose, the tools are there to do it easy enough using whatever analytics tools you wished

But we can create a single output db ( combining all the thread outputs, I think. Maybe Jason can provide the syntax for that, which could be used to load and db tool you want. See aggregate function for creating an aggregate database.

Hope this helps a bit.

Sent with BlackBerry Work (www.blackberry.com)

From: Nick Jones @.**@.>> Date: Wednesday, Jul 17, 2024 at 6:55 AM To: mar-file-system/GUFI @.**@.>> Cc: Subscribed @.**@.>> Subject: [EXTERNAL] Re: [mar-file-system/GUFI] Longitudinal Study: Phase 2 (Issue #150)

apologies @johnbenthttps://urldefense.com/v3/__https://github.com/johnbent__;!!Bt8fGhp8LhKGRg!AzQYvt9s2k7TNQnBcyLnKqdwrIjgqhTbpLD8AyOSqUR8nUhGhyOvPOU0YidAq1Vw5Jvng0T2BXYp6WiFUHJZMBi-$ if this isn't the right place, happy to be redirected. I've come across your threads on longitudinal study, and rolling up GUFI indexes to achieve a snapshot. I've been pondering a similar idea - the GUFI tool as it is is poweful and impressive. The power of it is of interest to a wider audience than many of the storage analysis tools available to our system administrators. one thing that limits wider access is the limited means of using SQL to work with GUFI. Passing SQL queries into GUFI via a command parameter restricts the exploratory analysis of GUFI data, in the sense of having only basic tooling available to explore datasets and develop queries of interest. On my mind is a post processing step of combining all index files into one database, which is then a first class database available for integration with other toolsets - for example, one could add Redash https://redash.io/https://urldefense.com/v3/__https://redash.io/__;!!Bt8fGhp8LhKGRg!AzQYvt9s2k7TNQnBcyLnKqdwrIjgqhTbpLD8AyOSqUR8nUhGhyOvPOU0YidAq1Vw5Jvng0T2BXYp6WiFUPQi7Fvs$ or any number of tools to such a database, build reports, charts, dashboards, and provide a much higher level of access and potential for valuing the insights possible from GUFI. now I'm not sure this is quite what you've got in mind, and ultimately it doesn't resolve the challenges in how you do combine the data into a snapshot, and allow for multiple snapshots to be maintained over time, into an historical longitudinal record. it perhaps suggests that having that raw data consolidated into one single database could be of higher value, and less initial work as it would only need post processing into an appropriate data structure that maintained the temporal dimension and resolved the folder hierarchy while maintaining all other indexed attributes. again, i'm probably well off track, so pls gentle steer away me if so :o).

— Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https://github.com/mar-file-system/GUFI/issues/150*issuecomment-2233135664__;Iw!!Bt8fGhp8LhKGRg!AzQYvt9s2k7TNQnBcyLnKqdwrIjgqhTbpLD8AyOSqUR8nUhGhyOvPOU0YidAq1Vw5Jvng0T2BXYp6WiFUFK53NKv$, or unsubscribehttps://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/ACZXDGNPQ2HCERXXWZS6YHLZMZLRTAVCNFSM6AAAAABLAPAUSSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMZTGEZTKNRWGQ__;!!Bt8fGhp8LhKGRg!AzQYvt9s2k7TNQnBcyLnKqdwrIjgqhTbpLD8AyOSqUR8nUhGhyOvPOU0YidAq1Vw5Jvng0T2BXYp6WiFUIJI2WVc$. You are receiving this because you are subscribed to this thread.Message ID: @.***>

calccrypto commented 2 months ago

contrib/longitudinal_snapshot.py is an early draft of a script that does this. It first aggregates the index (that the caller can access) as a single flat SQLite3 database file for calculating statistics with and then generates different views of the data depending on the first positional input argument.

The schema and example output from can be found in test/regression/longitudinal_snapshot.expected.

ndjones commented 2 months ago

thanks @garygrider @calccrypto and apologies, i should have raised a separate issue to discuss this.

agree entirely on the benefits of GUFI as it operates today, not least being the way any user can only gain access to their information.

It is the platform provider/operator/administrator user i'm thinking about. Specifically we've built our own limited solution to manage data lifecycles, and recently we've been running a proof of concept on alternate off the shelf tools prior to learning about GUFI. From what we've seen in other tools, the longitudinal view is powerful for a provider/operator like an HPC facility or similar. if there is interest on your side, perhaps we could arrange a call to discuss? i'm on nick.jones@nesi.org.nz if we want to take that offline.

I'll take a closer look at the snapshot code and would be happy to contribute thoughts back on topic here too. we're currently deploying a limited POC of GUFI so could use this to see how well it suits our needs and share our thoughts.