Closed filipmaljkovic closed 4 months ago
since "igraph" package cannot be installed without GCC 10+
This does not sound right. The igraph C library should work with GCC 4.8. The igraph R package might have accidentally acquired a requirements for a somewhat more recent GCC, but GCC 10 does not sound plausible. If you can't compile igraph with an old GCC that is supported with your R version, I suggest you open an issue.
Yeah, I agree igraph shouldn't require higher GCC version, but with version 4.8 I got the following error:
vendor/cigraph/src/community/infomap/infomap.cc:47:12: error: ‘isnan’ is already declared in this scope
and all the searches on the internet pointed me toward upgrading GCC to 10+, which actually did the trick.
So, whereas igraph doesn't officially require GCC 10+, for me it effectively did require it and the upgrade worked.
But I solved that on my own. The query thing I cannot solve, since it's a bit lower down the stack. However, @milou-brand said on the forum that the cohorts and code are being updated, so I shouldn't run the current version.
Hi @filipmaljkovic
I had a look into the CohortGeneratorModule and it looks like there have been some updates to the package. One of which might be similar to your problem, see here: https://github.com/OHDSI/CohortGeneratorModule/blob/main/NEWS.md#cohortgeneratormodule-041
I will make sure to integrate these into the design when I push it later today, let me know if it solves your problem.
Hi. I downloaded the new code, and indeed, a new folder gets created (CohortGeneratorModule_0.4.2). However, the same query gets stuck at the same place. 5% and it won't budge. I killed it after 5.5h hours of execution, but I reran it again now, just to be sure.
@filipmaljkovic thanks for sharing.
Do you have a writeable schema that you can use to store those temporary tables that are created when running the SQL? Have you tried extracting the SQL from the inst folder and see if you can run that in your local postgressql environment? Or extract the Json file from the inst folder and run it on your local ATLAS instance if you have one?
I'm trying to figure out where it's hanging itself as it doesn't look like the update did the trick.
Yes, I created temp.Codesets and started to analyze the query parts. I attached the analysis, as I did it so far. Most of the simpler subqueries are fast, but the slowness starts to show once we do group by ... having. The last subquery returned results in 55s (and that's just the inner join-ed query).
PhenoPhebStemiCharacterization-sql_analysis.txt
Unfortunately, I don't see any ways to make this faster. Any suggestions?
@filipmaljkovic I have received the same feedback from another data partner. It seems that it is a postgressql issue, I will check with the Strategus development team and see if I can find out more.
Hi @milou-brand and team - was looking into this study design a bit and saw that the cohort definition is using an initial restriction that may be causing some of the performance issues with this cohort. For reference:
The "restrict initial events to" requires exactly 0 occurrences of a non-stemi condition
which is causing the LEFT JOIN criteria in the cohort definition query. I suspect this is where you are facing the performance issues. Instead, you might consider revising this cohort definition a bit. You might consider moving the exactly 0 occurrences of a non-stemi condition
criterion to the inclusion rules section instead. This change will then perform a LEFT JOIN on the qualified events (STEMI condition, 1st time in history on or after 1 Jan 2016) which will be significantly smaller than the entire condition_occurrence
table that is part of the nested subqueries in the original definition.
Tagging @chrisknoll to double check my claims.
@filipmaljkovic I have pushed the update as suggested by @anthonysena above to Git. Let me know if this makes the query run more smoothly or not.
From an index perspective, the index that includes dates is helpful (in that it can range search on dates) but the primary column in the index should be on person_id, since we'll be joining across tables on person_id. Additionally, we will search by concept_id, so a secondary index on {domain}_concept_id also makes sense. In this index I don't think including dates matter because usually when we're searching by concept_id, it doesn't really matter when, it's just 'who has condition/drug/procedure X/Y/Z'. Once we know who they are, we will search for additional information about the person via join on person_id.
I'm not an expert on postgres topology and indexes but being able to index on people (to find the people in question) and then support a range scan on the rows via date is the 2 things that should be leveraged to make these queries faster.
Secondly, the default options that Postgres usually is configured with is for smaller scale queries and fast lookups. It means that it allocates less ram to support more connections to the data (like a transactional system). But when dealing with patient records, it could be a much larger set of data to seek and therefore may need more memory allocation to perform joins (in memory vs. swap to disk). I don't have these parameters that you should look into, but you should do some research on how your environment is configured to ensure you are configured to do the correct workload.
Hi everyone, thanks for trying to find ways to help out.
This update was actually fruitful: the first query finished almost immediately and the progress bar moved from 5% to 13% instantaneously, where it settled for a minute and then to 16%.
I'm letting it do its work and will update if there are any more problems, or, hopefully, if there's success. :)
We have success:
● completed target CohortDiagnosticsModule_2 [13.88 minutes]
▶ ended pipeline [18.464 minutes]
OK, now I have a different problem, unrelated to DB optimizations, but I thought I might report it here rather than open a new issue.
Namely, in the Results_compile.R, I get an error on line 31:
> RSQLite::dbConnect(RSQLite::SQLite(), file.path(outputLocation, "app" ,paste0(connectionDetailsReference, ".sqlite")))
Error: Could not connect to database:
unable to open database file
The path it's trying to find the sqlite db file is .../Strategus/outputCHCZ/app/CHCZ.sqlite, but to no surprise, it doesn't exist. This is the folder outputCHCZ folder after running Strategus' execute:
So, how do I get the app folder?
You can create the app folder yourself or you can paste this script to create the folder automatically.
subDir <- "app"
if (file.exists(paste0(outputFolder,subDir))){
print("no action taken")
} else {
print("created app folder in outputFolder")
dir.create(file.path(outputFolder, subDir))
}
I'd like to know anything you did to improve the DB performance. Performance is something that is under-documented and any information that improves quality-of-life when running this would be greatly appreciated.
@milou-brand thanks! I thought that the sqlite database should have been created during the "CodeToRun.R" execution -- if I knew I only had to create an empty folder, I would have done it simply :)
That said, I guess that it would be nice for that piece of code to be in "Results_compile.R", since it makes sense to execute it automatically.
However, execution of Results_compile.R does work, but up to a point. I get the following error:
Uploading file: cd_concept_sets.csv to table: cd_concept_sets
- Preparing to upload rows 1 through 5
An error report has been created at .../Strategus/outputCHCZ/upload-errorReport.R
An error report has been created at .../Strategus/outputCHCZ/upload-errorReport.R
Error in formatChunk(pkValuesInDb = env$primaryKeyValuesInDb, chunk = chunk) :
concept_set_id is of type numeric which cannot be converted between data frames pkValuesInDb and chunk
Here's the upload-errorReport.R
@chrisknoll I didn't do any further optimization other than the ones mentioned in the OP (way before this change in code): start and end dates for observation_period and condition_occurrence (I believe they aren't in there by default when you run the code to initialize the empty CDM DB).
@filipmaljkovic It looks like there is a discrepancy between how your tables have been set up and what CohortDiagnostics is expecting as input.
If I dive into the cohortDiagnostics package installed locally for me and then sql>sql_server>createResultsDataModel.sql It is expecting concept_set_id to be an INT rather than numeric in several places, including in the block of code where it tries to create the cd_concept_sets, here:
CREATE TABLE @results_schema.@concept_sets (
cohort_id BIGINT NOT NULL,
**concept_set_id INT NOT NULL,**
concept_set_sql VARCHAR NOT NULL,
concept_set_name VARCHAR(255) NOT NULL,
concept_set_expression VARCHAR NOT NULL,
PRIMARY KEY(cohort_id, concept_set_id)
);
Perhaps you could try to change concept_set_id to numeric throughout the sql, so it doesn't error?
Error in formatChunk(pkValuesInDb = env$primaryKeyValuesInDb, chunk = chunk) : concept_set_id is of type numeric which cannot be converted between data frames pkValuesInDb and chunk
Actually, I don't think this is at the DB level, it's an R issue. I believe this was a bugfix in ResultModelManager: https://github.com/OHDSI/ResultModelManager/blob/main/NEWS.md
ResultModelManager 0.5.7
Bug fixes:
Added type conversion checking on upload of data where columns are characters but interpreted as numeric from reading inserted data
Check your version of ResultModelManager that is installed, and try to update to at least 0.5.7.
@milou-brand
I might be daft, but all the tables with the stemi_v2 prefix are the following:
cdm=# \dt results.*
...
results | stemi_v2 | table | mega
results | stemi_v2_censor_stats | table | mega
results | stemi_v2_inclusion | table | mega
results | stemi_v2_inclusion_result | table | mega
results | stemi_v2_inclusion_stats | table | mega
results | stemi_v2_summary_stats | table | mega
...
I thought that the concept sets table was temporary. I don't have any concept set table in the results schema.
@chrisknoll
> packageVersion("ResultModelManager")
[1] ‘0.5.7’
Seems I have the proper version (which is no surprise, since I installed everything from this package quite recently, and 0.5.7 was released in early May).
@filipmaljkovic just checking the results_compile will compile the results from the results_folder in the outputCHCZ folder that you shared in a screenshot above, there should be three subfolders (DatabaseMetaData, CohortGeneratorModule_1, CohortDiagnosticsModule_2). Can you confirm there are csv files in these folders? This is essentially all the data output, the results_compile just wraps everything into one sqlite file.
@milou-brand Yes:
Everything in the Strategus execution seems to have completed successfully. app/CHCZ.sqlite is even 464KB large.
That does sound weird, especially since you have already have the newest resultsmodelmanager installed. Essentially those csv files is all the output I need for the study, I can try and see if I can run results_compile from my side? Otherwise happy to jump on a quick call if that's easier? My email is milou.brand@iqvia.com
@milou-brand and I had a call and we tried many things (downgrade / upgrade of ResultModelManager, changing the cd_concept_sets.csv in various ways) and nothing worked, until we changed the CohortDiagnostics library code. Namely, we changed "CohortDiagnostics/sql/sql_server/CreateResultsDataModel.sql" and replaced INT with VARCHAR for:
Only then was I able to successfully run the Results_compile.R.
I managed to run the app.R (although the current version of the code accesses the sqlite db file in the app, and shouldn't actually be moved to the app folder, or if you move the file to the app folder, then line 18 should read sqliteDbPath <- file.path(paste0(connectionDetailsReference, ".sqlite"))
)
The app opens, but I get the error in question when trying to do cohort diagnostics. The CLI log shows only: Failed to load module CohortDiagnostics
Seems like now it's struggling with our VARCHAR input as it seems to fail on timeId, which if my memory is correct is one of the ones that we potentially changed.
I did some more searching into the sqlite sql, could try changing the concept_sets sql to cast(_cohortid, i.e. 36/37 as int)? I'm not sure if that will work, but it's worth the try?
Yes, you're right. After changing the VARCHAR columns to INTEGER (using SQLite IDE), CohortDiagnostics works.
Since all the problems have been solved, I'm closing the issue.
After some fiddling about (also, for other people reading this, it should be noted that there's a hard lower limit for the version of R, since Strategus requires R>=4.2.0, and also for GCC, when talking about Linux, since "igraph" package cannot be installed without GCC 10+), I got the Strategus code running. So, after creating initial tables, one of the first queries starts executing and never seems to end.
Here's my R console log:
I managed to catch the query (and re-format it), which is as follows:
However, since Codesets seems to be a temp table, I can't really run this query or analyze it properly.
After running for 17 hours, I killed the query, and then added some indexes (start and end dates for observation_period and condition_occurrence weren't there, so I added them, since those columns are used heavily in the query) and reran the query and now it's been running for more than 4 hours.
Is there anything (else) that can be done to make this query finish in finite time?
Note that the server I'm running this on is pretty idle, and has 18 CPU cores, and only one is being used to executing the query (I hear Postgres v9.6 onward has support for multiple cores / parallel execution, and I'm using Postgres 10).