ncbi / BioRED

19 stars 4 forks source link

convert_pubtator_2_bert.py cannot correctly create the tsv file #5

Closed pyramid20002000 closed 10 months ago

pyramid20002000 commented 10 months ago

Hello,

I have a quite wired problem in using this code.

When I use the BioRED.zip file downloaded from: https://ftp.ncbi.nlm.nih.gov/pub/lu/BioRED/BioRED.zip

everything works fine. (I only use the prediction part, no training is tested)

The script that I used is run_test_pred.sh.

But when I did prediction with new data, it failed with error message below: Traceback (most recent call last): File "src/run_biored_exp.py", line 795, in <module> main() File "src/run_biored_exp.py", line 779, in main test_dataset = processor.get_test_dataset_by_name(data_args.test_file, data_args.test_has_header) File "src/run_biored_exp.py", line 257, in get_test_dataset_by_name return self._get_dataset(file_name, "test", has_header) File "src/run_biored_exp.py", line 143, in _get_dataset data_df = pd.read_csv(data_file, sep='\t', header=None, dtype=str, keep_default_na=False) File "/usr/local/lib/python3.8/dist-packages/pandas/io/parsers/readers.py", line 912, in read_csv return _read(filepath_or_buffer, kwds) File "/usr/local/lib/python3.8/dist-packages/pandas/io/parsers/readers.py", line 577, in _read parser = TextFileReader(filepath_or_buffer, **kwds) File "/usr/local/lib/python3.8/dist-packages/pandas/io/parsers/readers.py", line 1407, in __init__ self._engine = self._make_engine(f, self.engine) File "/usr/local/lib/python3.8/dist-packages/pandas/io/parsers/readers.py", line 1679, in _make_engine return mapping[engine](f, **self.options) File "/usr/local/lib/python3.8/dist-packages/pandas/io/parsers/c_parser_wrapper.py", line 93, in __init__ self._reader = parsers.TextReader(src, **kwds) File "pandas/_libs/parsers.pyx", line 557, in pandas._libs.parsers.TextReader.__cinit__ pandas.errors.EmptyDataError: No columns to parse from file cp: cannot stat 'out_model_biored_novelty/test_results.tsv': No such file or directory Generating PubTator file Traceback (most recent call last): File "src/utils/run_biored_eval.py", line 910, in <module> dump_pred_2_pubtator_file(in_test_pubtator_file = in_test_pubtator_file, File "src/utils/run_biored_eval.py", line 186, in dump_pred_2_pubtator_file pmid_2_rel_pairs_dict = add_relation_pairs_dict( File "src/utils/run_biored_eval.py", line 61, in add_relation_pairs_dict testdf = pd.read_csv(in_gold_tsv_file, sep="\t", index_col=0) File "/usr/local/lib/python3.8/dist-packages/pandas/io/parsers/readers.py", line 912, in read_csv return _read(filepath_or_buffer, kwds) File "/usr/local/lib/python3.8/dist-packages/pandas/io/parsers/readers.py", line 577, in _read parser = TextFileReader(filepath_or_buffer, **kwds) File "/usr/local/lib/python3.8/dist-packages/pandas/io/parsers/readers.py", line 1407, in __init__ self._engine = self._make_engine(f, self.engine) File "/usr/local/lib/python3.8/dist-packages/pandas/io/parsers/readers.py", line 1679, in _make_engine return mapping[engine](f, **self.options) File "/usr/local/lib/python3.8/dist-packages/pandas/io/parsers/c_parser_wrapper.py", line 93, in __init__ self._reader = parsers.TextReader(src, **kwds) File "pandas/_libs/parsers.pyx", line 557, in pandas._libs.parsers.TextReader.__cinit__ pandas.errors.EmptyDataError: No columns to parse from file

I tried to track the problem, by comparing the difference between the right run and the wrong run step by step. I found that the "out_processed.tsv" is an empty file ( 0 kb) when it went wrong.

I continue to track this problem, and only to find out that in order to produce a tsv file that is not empty, the pubtator file should contain relations such as Association .....

for example, this pubtator file works: 15485686|t|A novel SCN5A mutation manifests as a malignant form of long QT syndrome with perinatal onset of tachycardia/bradycardia. 15485686|a|OBJECTIVE: Congenital long QT syndrome (LQTS) with in utero onset of the rhythm disturbances is associated with a poor prognosis. In this study we investigated a newborn patient with fetal bradycardia, 2:1 atrioventricular block and ventricular tachycardia soon after birth. METHODS: Mutational analysis and DNA sequencing were conducted in a newborn. The 2:1 atrioventricular block improved to 1:1 conduction only after intravenous lidocaine infusion or a high dose of mexiletine, which also controlled the ventricular tachycardia. RESULTS: A novel, spontaneous LQTS-3 mutation was identified in the transmembrane segment 6 of domain IV of the Na(v)1.5 cardiac sodium channel, with a G-->A substitution at codon 1763, which changed a valine (GTG) to a methionine (ATG). The proband was heterozygous but the mutation was absent in the parents and the sister. Expression of this mutant channel in tsA201 mammalian cells by site-directed mutagenesis revealed a persistent tetrodotoxin-sensitive but lidocaine-resistant current that was associated with a positive shift of the steady-state inactivation curve, steeper activation curve and faster recovery from inactivation. We also found a similar electrophysiological profile for the neighboring V1764M mutant. But, the other neighboring I1762A mutant had no persistent current and was still associated with a positive shift of inactivation. CONCLUSIONS: These findings suggest that the Na(v)1.5/V1763M channel dysfunction and possible neighboring mutants contribute to a persistent inward current due to altered inactivation kinetics and clinically congenital LQTS with perinatal onset of arrhythmias that responded to lidocaine and mexiletine. 15485686 8 13 SCN5A GeneOrGeneProduct 6331 15485686 56 72 long QT syndrome DiseaseOrPhenotypicFeature D008133 15485686 97 108 tachycardia DiseaseOrPhenotypicFeature D013610 15485686 109 120 bradycardia DiseaseOrPhenotypicFeature D001919 15485686 144 160 long QT syndrome DiseaseOrPhenotypicFeature D008133 15485686 162 166 LQTS DiseaseOrPhenotypicFeature D008133 15485686 292 299 patient OrganismTaxon 9606 15485686 311 322 bradycardia DiseaseOrPhenotypicFeature D001919 15485686 328 350 atrioventricular block DiseaseOrPhenotypicFeature D054537 15485686 355 378 ventricular tachycardia DiseaseOrPhenotypicFeature D017180 15485686 482 504 atrioventricular block DiseaseOrPhenotypicFeature D054537 15485686 555 564 lidocaine ChemicalEntity D008012 15485686 592 602 mexiletine ChemicalEntity D008801 15485686 630 653 ventricular tachycardia DiseaseOrPhenotypicFeature D017180 15485686 685 689 LQTS DiseaseOrPhenotypicFeature D008133 15485686 767 775 Na(v)1.5 GeneOrGeneProduct 6331 15485686 784 790 sodium ChemicalEntity D012964 15485686 807 839 G-->A substitution at codon 1763 SequenceVariant c|SUB|G|CODON1763|A 15485686 857 891 valine (GTG) to a methionine (ATG) SequenceVariant p|SUB|V||M 15485686 1018 1024 tsA201 CellLine CVCL_2737 15485686 1092 1104 tetrodotoxin ChemicalEntity D013779 15485686 1119 1128 lidocaine ChemicalEntity D008012 15485686 1366 1372 V1764M SequenceVariant p|SUB|V|1764|M 15485686 1408 1414 I1762A SequenceVariant p|SUB|I|1762|A 15485686 1557 1565 Na(v)1.5 GeneOrGeneProduct 6331 15485686 1566 1572 V1763M SequenceVariant p|SUB|V|1763|M 15485686 1731 1735 LQTS DiseaseOrPhenotypicFeature D008133 15485686 1760 1771 arrhythmias DiseaseOrPhenotypicFeature D001145 15485686 1790 1799 lidocaine ChemicalEntity D008012 15485686 1804 1814 mexiletine ChemicalEntity D008801 15485686 Association D001919 6331 Novel 15485686 Positive_Correlation D001919 p|SUB|V|1763|M Novel 15485686 Association D013610 6331 Novel 15485686 Positive_Correlation D013610 p|SUB|V|1763|M Novel 15485686 Association 6331 D001145 Novel 15485686 Negative_Correlation D001145 D008801 No 15485686 Negative_Correlation D001145 D008012 No 15485686 Positive_Correlation p|SUB|V|1763|M D001145 Novel 15485686 Positive_Correlation p|SUB|V|1763|M D008133 Novel 15485686 Association D008133 6331 Novel 15485686 Association D008133 p|SUB|V||M Novel 15485686 Association D008133 c|SUB|G|CODON1763|A Novel 15485686 Negative_Correlation D008133 D008801 No 15485686 Negative_Correlation D008133 D008012 No 15485686 Negative_Correlation D008801 D017180 No 15485686 Negative_Correlation D008012 D017180 No 15485686 Negative_Correlation D054537 D008801 No 15485686 Negative_Correlation D054537 D008012 No

Can anybody help me to figure out what is going wrong in my experiments??

Many thanks.

ptlai commented 10 months ago

We have resolved the problem through email communication. @pyramid20002000

pyramid20002000 commented 10 months ago

Thanks to Dr. Lai's help.

In order to help more people, I will explain the problem and post the solution here : convert_pubtator_2_bert.py requires pubtator file to generate the tsv file, which will be used as input for this model. However, the standard format of pubtator is slight different from the format required in biored. Mainly becasue the Entity Types of BioRED and Pubtator are different.

Dr Lai provided a script to convert PubTator file to BioRED Pubtator file. The script is in attachment.

convert_pubtator_2_biored.zip

Darrshan-Sankar commented 1 month ago

Thanks to Dr. Lai's help.

In order to help more people, I will explain the problem and post the solution here : convert_pubtator_2_bert.py requires pubtator file to generate the tsv file, which will be used as input for this model. However, the standard format of pubtator is slight different from the format required in biored. Mainly becasue the Entity Types of BioRED and Pubtator are different.

Dr Lai provided a script to convert PubTator file to BioRED Pubtator file. The script is in attachment.

convert_pubtator_2_biored.zip

@pyramid20002000 Thanks for your resource. How to convert the pubtator generated from AIONER results to the format that BioRED has or the format that the pubtator API gives finally? Request some ideas on that