sepinf-inc / IPED

IPED Digital Forensic Tool. It is an open source software that can be used to process and analyze digital evidence, often seized at crime scenes by law enforcement or in a corporate investigation by private examiners.
Other
924 stars 217 forks source link

Just first regex hit is shown if multiple regex patterns match the same input string #1897

Closed lfcnassif closed 11 months ago

lfcnassif commented 11 months ago

Reported on https://github.com/sepinf-inc/IPED/discussions/1745

wladimirleite commented 11 months ago

I can take a look at this, @lfcnassif.

lfcnassif commented 11 months ago

If you have available time @tc-wleite, that would be great! Thank you very much for all your volunteer work on this project!

wladimirleite commented 11 months ago

I was able to reproduce and fix the issue reported by @paulobreim (https://github.com/sepinf-inc/IPED/discussions/1745#discussion-5362046). If a string matches more than one regex, only the first one was considered.

It doesn't seem to be the same situation reported by @milcent-CVM (https://github.com/sepinf-inc/IPED/discussions/1745#discussioncomment-7101416).

@milcent-CVM, can you provide a couple of sample strings that should match the regex you created?

milcent-CVM commented 11 months ago

Sure! Thank you!

Let me just post the (still in evolution - regex101.com) Regex Pattern that got the most hits (still doesn't account for all latin characters in the person's name and expects case insensitive, which is the default in IPED):

\b(?:\n*)+(?:\s*\-?)?(?:[^\-]relatora?)+(?:\s*:?\s*)(?:diretora?|presidente|)?\s*(\w+é*(?:(?: +\w+)+))|(\w+(?:(?: +\w+é*)+))(?:\r*\t*\n*)(?:diretora?)\-(?:relatora?)\b

I will join here many parts of the articles in one peace that would result in many hits, OK?

""" RELATOR : WLADIMIR CASTELO BRANCO CASTRO

DURVAL JOSÉ SOLEDADE SANTOS Diretor-Relator

JOSÉ LUIZ OSORIO DE ALMEIDA FILHO RELATOR : Diretor Durval José Soledade Santos

Dos fatos Lei nº 6.404/1976.

Diretor Relator: Alexandre Costa Rangel

Relatório de Julgamento (1251150) SEI 19957.010729/2019-31 / pg. 1

Data do julgamento: 23/06/2020

Relator: Diretor Henrique Machado

Acusados:

LEONARDO BRUNET MENDES DE MORAES

Diretor-Relator

FRANCISCO AUGUSTO DA COSTA E SILVA

Presidente

RELATÓRIO

Relator: Leonardo Brunet Mendes De Moraes

DOS FATOS

WLADIMIR CASTELO BRANCO CASTRO Diretor-Relator

Presidente da Sessão

RELATÓRIO

Relator : Diretor Wladimir Castelo Branco Castro

Rio de Janeiro, 04 de abril de 2007.

Maria Helena de Santana Diretora-Relatora

Marcelo Fernandez Trindade Presidente da Sessão de Julgamento

Rio de Janeiro, 18 de dezembro de 2007.

Durval Soledade Diretor-Relator

Maria Helena dos Santos Fernandes de Santana Participaram do julgamento os Diretores Marcos Barbosa Pinto, Relator, Durval Soledade, Sergio Weguelin e a Presidente da CVM, Maria Helena dos Santos Fernandes de Santana.

Rio de Janeiro, 28 de agosto de 2007.

Marcos Barbosa Pinto Diretor-Relator

Maria Helena dos Santos Fernandes de Santana Rio de Janeiro, 21 de agosto de 2007.

Eli Loria Diretor-Relator e Presidente da Sessão de Julgamento """

wladimirleite commented 11 months ago

@milcent-CVM, using the regex and the text posted, regex101 is not showing any matches (screenshot below). Can you check if I am doing something different from what you are?

image

milcent-CVM commented 11 months ago

@tc-wleite , this is probably due to regex101 being case-sensitive, while IPED is not (at least according to RegexConfig.txt). The Regex that works in regex101 is the one below:

\b(?:\n*)+(?:\s*\-?)?(?:[^\-]Relatora?|RELATORA?)+(?:\s*:?\s*)(?:Diretora?|DIRETORA?|Presidente|PRESIDENTE)?\s*(\w+(?:(?: +\w+é*É*)+))|(\w+(?:(?: +\w+é*É*)+))(?:\r*\t*\n*)(?:Diretora?|DIRETORA?)\-(?:Relatora?|RELATORA?)\b

image

wladimirleite commented 11 months ago

@milcent-CVM, it seems that syntax used by regex101 and IPED (which uses dk.brics.automaton) is not the same. Please, take a look at: https://www.brics.dk/automaton/faq.html https://www.brics.dk/automaton/doc/dk/brics/automaton/RegExp.html

Trying simpler expressions first in IPED should help. Or using a small standalone program to test the dk.brics.automaton library:

import dk.brics.automaton.*;
public class Test {
    public static void main(String[] args) {
        RegExp r = new RegExp("([^\\-]relatora?)");
        Automaton a = r.toAutomaton();
        String input = " relator".toLowerCase();
        System.out.println(a.run(input));
    }
}

This program output is "true". But if I change the expression to "(?:[^\-]relatora?)", it won't match. However, in regex101, it does: image

milcent-CVM commented 11 months ago

OK!! Thank you very much!! I will start with simpler patterns and work my way up, and try to find some testing framework more compatible with IPED’s Regex. Cheers!

Marcel Milcent


De: Luis Filipe Nassif @.> Enviado: Tuesday, September 26, 2023 11:25:55 PM Para: sepinf-inc/IPED @.> Cc: Marcel Tavares Quinteiro Milcent Assis @.>; Mention @.> Assunto: Re: [sepinf-inc/IPED] Custom user regex patterns in RegexConfig.txt could be ignored (Issue #1897)

Closed #1897https://github.com/sepinf-inc/IPED/issues/1897 as completed via #1900https://github.com/sepinf-inc/IPED/pull/1900.

— Reply to this email directly, view it on GitHubhttps://github.com/sepinf-inc/IPED/issues/1897#event-10481780244, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ATVENE6ARPLVUG4WEI2XY73X4OFDHANCNFSM6AAAAAA5G7GH3A. You are receiving this because you were mentioned.Message ID: @.***>