Web scraping and filtering code for slovak contract database - crz.gov.sk. The code downloads XML databases, creates a CSV database of contracts, filters them, downloads the files, extracts and cleans up tables with MD rates.
The procesed files are contracts. Their structure, wording and even particular paragraphs vary significantly, so it would be interesting to utilise artificial intelligence understanding natural language to extract additional data from the contracts, such as:
Whether there are subcontractors or not (there was even a case when there were all the necessary paragraphs handling possible subcontractors, referencing certain annex containing list of subcontractors. However, the annex just contained a simple sentence stating there will be no subcontractors for that particular project).
Contract type - SLA, works contract, etc. In some cases, this can only be read from the context.
Whether the contract actually contains information about MD rates
Seniority of assigned job positions
SLA parametres, if it is a SLA contract (e.g. service availability, assignment of hotline levels to particular subjects, the border line between maintenance and upgrade, handling license and HW management, etc.)
Other indicators (depends on what comes out of the discussion)
The procesed files are contracts. Their structure, wording and even particular paragraphs vary significantly, so it would be interesting to utilise artificial intelligence understanding natural language to extract additional data from the contracts, such as: