stevencox / chemotext

1 stars 1 forks source link

Word Distances Seem High #7

Open stevencox opened 7 years ago

stevencox commented 7 years ago

Are we getting sentence distances wrong?

- I've seen several lines where the word distance seems really high for the sentence distance 
['21509038', '1303344000', '20-04-2011', '"cisplatin"', '"tp53"', '0', '1', '223', 'true', '1303344000\n']
['21624110', '1306713600', '29-05-2011', '"cisplatin"', '"tp53"', '0', '0', '87', 'true', '1306713600\n']
['21624110', '1306713600', '29-05-2011', '"doxorubicin"', '"tp53"', '0', '0', '103', 'true', '1306713600\n']
['23360326', '1359417600', '28-01-2013', '"tyrosine"', '"p53"', '0', '0', '408', 'false', '1359417600\n']
stevencox commented 7 years ago

@cpschmitt,

  1. The last column before the true/false flag is word distance, not sentence distance:
#pubmed_id pubmed_date_unix_epoch_time pubmed_date_human_readable binary_a_term binary_b_term paragraph_distance sentence_distance word_distance flag_if_valid  time_until_verified
  1. Looking at this article (the one in the first row), the distances seem reasonable:
 /projects/stars/var/chemotext/output.mesh.full.2016-07-09/Cell_Death_Dis_2011_Apr_21_2\(4\)_e148.fxml.json
stevencox commented 7 years ago

@cpschmitt, what you show below is likely also an issue. But more basically, word position currently counts absolute position within a document, not within a sentence.

The word distances are just too big given that really long sentences are typically not more than 30-40 words.  I took a look at the sentences part of the json file you mention and I think I see the problem.  Below are a few example sentences from the sentence section.  In these, there are periods marking sentence boundaries, but these are followed by numbers (e.g., ... tc cell lines. 2,3,4,5 an important...) - the numbers appear to be citation markers, but I'm thinking that those are throwing off the sentence boundary detector.

"most testicular cancer (tc) patients respond well to cisplatin-based chemotherapy; however, there is still a subset of these young patients that will die because of chemo-resistant or chemo-refractory disease.1 similar to its effects in patients, cisplatin proved to be an extremely cytotoxic drug, inducing massive apoptosis in human tc cell lines.2, 3, 4, 5 an important role of p53 in the response to chemotherapeutic drugs and the execution of apoptosis has been described.6 the p53 is a tumour suppressor protein with a dual role in stress response by transactivation of genes that induce apoptosis, such as fas (tnfrsf6), as well as genes that induce cell-cycle arrest, such as cyclin-dependent kinase inhibitor 1a gene (cdkn1a), encoding p21cip1/waf1, allowing time for dna repair. ",
​

  "tumours that retain wild-type p53 are supposed to have other defects in the p53 pathway, such as the presence of microrna (mir)-371-373, mir-106b-seed-family members or cytoplasmic p21, the lack of phosphatase and tensin homologue (pten) expression or the increased mouse double minute 2 (mdm2) expression.16, 17, 18, 19 mdm2, as transcriptional target of p53, is the main negative feedback regulator of p53. ",

"this (re)activation leads to cell-cycle arrest and or apoptosis in tumour cells with wild-type p53.20, 21, 22, 23 restoration of p53 function by nutlin-3 may thus have profound therapeutic effect on tumours that have retained wild-type p53, particularly if mdm2 activity is disproportionally increased.23 recently, nutlin-3-induced apoptosis was investigated in a small panel of tc cell lines, and only additive effects were seen in combination with cisplatin. "