quanteda / readtext

an R package for reading text files
https://readtext.quanteda.io
118 stars 28 forks source link

readtext issue: ligature artifacts remain, even after specifying UTF-8 encoding. #146

Open MSU2580 opened 5 years ago

MSU2580 commented 5 years ago

After reading in a corpus of scientific work in the following manner, we eventually get an object back out (FinalSciCorp) which contains all unique words within the corpus, along with some other information:

> tempsci <-  readtext("*.txt",encoding = "UTF-8") 
> sciCorp <- corpus(tempsci)
 >  doc_term_matrix <- dfm(sciCorp,remove = stopwords("english"),remove_punct = TRUE,remove_numbers=TRUE,stem = FALSE)
> FinalSciCorp = textstat_frequency(doc_term_matrix)

However, FinalSciCorp still contains some words with ligatures such as "ff" and "fi", among others. As an example, FinalSciCorp contains both the words "field" and "<U+FB01>eld", or in another case just the word "signi<U+FB01>cant". The 'encoding' and 'stri_enc_detect' functions both indicate that the files are likely "UTF-8" although we have also tried many other options, including "latin1" for encoding.

kbenoit commented 5 years ago

Can you send me a link to one of the documents containing a ligature, as well as your sessionInfo() output, so I can test it?

MSU2580 commented 5 years ago

Hello,

I apologize if you prefer another format (let me know), but the following should contain the information you desire. Note that attached is one of the documents which contains ligature issues.

The sessionInfo() data:

sessionInfo() R version 3.4.4 (2018-03-15) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C [5] LC_TIME=English_United States.1252

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] readtext_0.71 quanteda_1.3.14

loaded via a namespace (and not attached): [1] Rcpp_0.12.16 magrittr_1.5 stopwords_0.9.0 munsell_0.4.3 colorspace_1.3-2 lattice_0.20-35 R6_2.2.2 rlang_0.2.0 [9] fastmatch_1.1-0 stringr_1.3.0 httr_1.3.1 plyr_1.8.4 tools_3.4.4 grid_3.4.4 data.table_1.10.4-3 gtable_0.2.0 [17] spacyr_1.0 yaml_2.1.18 lazyeval_0.2.1 RcppParallel_4.4.2 tibble_1.4.2 Matrix_1.2-12 ggplot2_2.2.1 ISOcodes_2018.06.29 [25] stringi_1.1.7 compiler_3.4.4 pillar_1.2.1 scales_0.5.0 lubridate_1.7.4

The code I utilized (as it pertains to just the attached file):

library(quanteda) library(readtext) tempsci <- readtext("H0101.txt",encoding = "UTF-8")

sciCorp <- corpus(tempsci)

doc_term_matrix <- dfm(sciCorp,remove = stopwords("english"),remove_punct = TRUE,remove_numbers=TRUE,stem = FALSE)

FinalSciCorp = textstat_frequency(doc_term_matrix)

Thank you for your time, it is most appreciated!

Best,


Keith L. Johnson, Ph. D.

Montana State University

Physics Department, Barnard Hall 226

PER Group


From: Kenneth Benoit notifications@github.com Sent: Sunday, February 10, 2019 3:55:53 PM To: quanteda/readtext Cc: Johnson, Keith; Author Subject: Re: [quanteda/readtext] readtext issue: ligature artifacts remain, even after specifying UTF-8 encoding. (#146)

Can you send me a link to one of the documents containing a ligature, as well as your sessionInfo() output, so I can test it?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/quanteda/readtext/issues/146#issuecomment-462187845, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AtNcbUh-PkYWXZp0h5KRO3avv2IgqhWaks5vMKN5gaJpZM4ayygK.



40-year trends in an index of survival for all cancers combined and survival adjusted for age and sex for each cancer in England and Wales, 1971–2011: a population-based study

Summary

Background Assessment of progress in cancer control at the population level is increasingly important. Population-based survival trends provide a key insight into the overall eff ectiveness of the health system, alongside trends in incidence and mortality. For this purpose, we aimed to provide a unique measure of cancer survival.

Methods In this observational study, we analysed trends in survival with population-based data for 7·2 million adults diagnosed with a first, primary, invasive malignancy in England and Wales during 1971–2011 and followed up to the end of 2012. We constructed a survival index for all cancers combined using data from the National Cancer Registry and the Welsh Cancer Intelligence and Surveillance Unit. The index is designed to be independent of changes in the age distribution of patients with cancer and of changes in the proportion of lethal cancers in each sex. We analysed trends in the cancer survival index at 1, 5, and 10 years after diagnosis for the selected periods 1971–72, 1980–81, 1990–91, 2000–01, 2005–06, and 2010–11. We also estimated trends in age-sex-adjusted survival for each cancer. We define the difference in net survival between the oldest (75–99 years) and youngest (15–44 years) patients as the age gap in survival. We evaluated the absolute change (%) in the age gap since 1971.

Findings The overall index of net survival increased substantially during the 40-year period 1971–2011, both in England and in Wales. For patients diagnosed in 1971–72, the index of net survival was 50% at 1 year after diagnosis. 40 years later, the same value of 50% was predicted at 10 years after diagnosis. The average 10% survival advantage for women persisted throughout this period. Predicted 10-year net survival adjusted for age and sex for patients diagnosed between 2010 and 2011 ranged from 1·1% for pancreatic cancer to 98·2% for testicular cancer. Net survival for the oldest patients (75–99 years) was persistently lower than for the youngest (15–44 years), even after adjustment for the much higher mortality from causes other than cancer in elderly people.

Interpretation These findings support substantial increases in both short-term and long-term net survival from all cancers combined in both England and Wales. The net survival index provides a convenient, single number that summarises the overall patterns of cancer survival in any one population, in each calendar period, for young and old men and women and for a wide range of cancers with very disparate survival. The persistent sex difference is partly due to a more favourable cancer distribution in women than men. The very wide differences in survival for different cancers, and the persistent age gap in survival, suggest the need for renewed eff orts to improve cancer outcomes. Future monitoring of the cancer survival index will not be possible unless the current crisis of public concern about sharing of individual data for public health research can be resolved.

Introduction Cancer is an increasing public health concern, shown by substantial investments in human and fi nancial resources for cancer management since the late 1990s. Health policy measures have focused on improvement of the organisation and delivery of services for prevention, diagnosis, and treatment. Research has provided the evidence base for these policies and is increasingly used to assess their eff ect.1–7 The assessment of progress in cancer control has become crucial. Population-based cancer survival trends provide a key insight into the overall eff ectiveness of the health system, alongside incidence and mortality.8

In this population-based survival study, we analysed cancer survival trends during the past four decades in England and Wales using two metrics: an index of survival for all cancers combined, and survival for each cancer, adjusted for age and sex. The all-cancers survival index was designed to provide one summary measure of cancer survival that can be monitored over time to show the overall progress in the eff ectiveness of the health-care system. It was also designed to support assessment of the eff ect of earlier diagnosis, which is a key component of the National Awareness and Early Diagnosis Initiative.9–11 Trends in survival for individual cancers will underline those cancer types for which

1206

there has been progress and those for which prognosis has remained poor.

Methods Study design Survival varies very widely with the age and sex of a patient with cancer and with the type of cancer. The frequency of diff erent cancers is also changing over time: some cancers with poor prognosis, such as stomach and lung cancer, have become less common, whereas breast cancer in women, for which survival has been improving, has become more common. These trends can diff er between the sexes: lung cancer has become much less common in men, but more common in women. The age profi le of patients with cancer also changes over time, and these trends can diff er between cancers. To enable valid assessment of survival trends for all cancers combined, the survival index must therefore take account of changes over time in the distribution of age, sex, and cancer type in all patients with cancer, especially over periods as long as 40 years. Similarly, trends in survival for each cancer must be adjusted for changes over time in the age (and sex) profi le of patients with cancer.

Data sources We examined survival trends in 7 176 795 adults (aged 15–99 years) diagnosed with a fi rst, primary, invasive malignancy in England and Wales during 1971–2011, and followed up to Dec 31, 2012 (table 1). Data for England were obtained from the National Cancer Registry at the Offi ce for National Statistics12 and for Wales from the Welsh Cancer Intelligence and Surveillance Unit. Patients diagnosed with a malignancy of the skin other than melanoma were excluded. Since 1971, the National Health Service Central Register has routinely updated these individual cancer records with information about each patient’s vital status (alive, emigrated, dead, or not traced). The vital status at Dec 31, 2012, was known for 98·4% of these patients. During the 41-year period, 4·3% of all cancer registrations were for the patient’s second-order or higher-order tumour: in the analyses for all cancers combined, the higher-order cancers were not included.

Statistical analysis The all-cancers survival index was constructed as a weighted average of the survival estimates for every combination of age group at diagnosis (15–44, 45–54,

ICD-10 code*

England

Oesophagus Stomach Colon Rectum Pancreas Larynx (men) Lung Melanoma Breast (women) Cervix Uterus Ovary Prostate Testis Kidney Bladder Brain Hodgkin’s disease Non-Hodgkin lymphoma Myeloma Leukaemia Other cancers† Total

C15 C16 C18 C19–C21 C25 C32 C33, C34 C43 C50 C53 C54, C55 C56, C57.0–7 C61 C62 C64–C66, C68 C67 C71 C81 C82–C85 C90 C91–C95

·· ··

Women

Number

67 474 115 294 292 352 143 610 92 631

··

349 711 97 627 1 039 609 117 404 160 539 172 400

·· ··

53 197 90 204 41 952 19 114 99 752 43 446 70 760 275 408 3 342 484

Men

%

Number

%

2·0% 3·4% 8·7% 4·3% 2·8% ··

10·5% 2·9% 31·1% 3·5% 4·8% 5·2% ·· ·· 1·6% 2·7% 1·3% 0·6% 3·0% 1·3% 2·1% 8·2% 100·0%

106 793 194 333 271 220 204 363 93 450 52 618 751 958 72 743

·· ·· ·· ··

638 111 48 031 89 986 239 621 59 192 26 714 114 269 48 136 92 917 296 794 3 401 249

3·1% 5·7% 8·0% 6·0% 2·7% 1·5% 22·1% 2·1% ·· ·· ·· ··

18·8% 1·4% 2·6% 7·0% 1·7% 0·8% 3·4% 1·4% 2·7% 8·7% 100·0%

Wales

Women

Number

4953 8627 17 711 9731 5868 ··

21 027 5429 65 370 8272 10 836 11 051 ·· ··

3431 5897 2832 1145 5630 2805 4686 18 624 213 925

Men

%

Number

%

2·3% 4·0% 8·3% 4·5% 2·7% ·· 9·8% 2·5% 30·6% 3·9% 5·1% 5·2% ·· ·· 1·6% 2·8% 1·3% 0·5% 2·6% 1·3% 2·2% 8·7% 100·0%

6857 14 299 17 736 14 358 6014 3529 45 601 4372

·· ·· ·· ··

41 559 2743 5804 15 962 3786 1675 6320 3041 6112 19 369 219 137

3·1% 6·5% 8·1% 6·6% 2·7% 1·6% 20·8% 2·0% ·· ·· ·· ··

19·0% 1·3% 2·6% 7·3% 1·7% 0·8% 2·9% 1·4% 2·8% 8·8% 100·0%

*Tenth revision of the International Classifi cation of Diseases (ICD): malignancies were initially coded according to the ICD revision in use during the year of diagnosis—ie, ICD 8 (1971–78), 9 (1979–95), or 10 (1996–). †Other cancers: all other malignant tumours are combined; they also include laryngeal cancer in women and breast cancer in men.

Table 1: Number of patients (aged 15–99 years) included in analyses in England and Wales diagnosed from 1971 to 2011 and followed up to 2012, by sex and type of malignancy

1207

55–64, 65–74, and 75–99 years), sex (male and female), and type of cancer (the 21 most common malignancies are shown in table 1 and all other malignant tumours are combined). The weights used were the proportion of patients with cancer diagnosed in England and Wales during 1996–99 in each of the 185 combinations of age group, sex, and type of cancer. We also constructed the all- cancers survival index separately for males and females and estimated survival adjusted for age and sex by cancer.

All adults, England

All adults, Wales

1-year survival 5-year survival 10-year survival Prediction

61

41

34

56

35

28

65

47

41

56

35

29

50

30

24

Men, England

51

30

23

45

25

20

Women, England

61

41

34

55

34

28

70

54

50

68

51

46

65

47

42

67

49

46

74

59

54

64

46

41

72

56

50

61

42

36

69

53

47

54

35

29

48

29

24

48

29

23

Men, Wales

43

25

20

Women, Wales

Period of diagnosis (year)

Period of diagnosis (year)

Figure 1: Trends in the index of net survival for all cancers combined, for England and for Wales: all adults (15–99 years), men, and women, selected periods during 1971–2011

Net survival was used as the cancer survival measure for each component of the indexes. Net survival quantifi es the survival after taking account of death from other causes (background mortality). All patients were allocated a deprivation category defi ned according to their Lower Super Output Area (mean population about 1500) of residence at the time of cancer diagnosis. Life- tables were used to take account of the wide variation in background mortality by age, sex, deprivation, region, and over time. For this study, separate life-tables were created for England and Wales by single year of age, sex, deprivation category, and (in England) region of residence, for every calendar year between 1971 and 2012.13 National or regional life-tables were used for the 2·8% of patients diagnosed in England (2·6% in Wales) who could not be assigned to a specifi c deprivation category or (in England) region; almost all of these patients were diagnosed in the 1970s (85% in England, 55% in Wales) or 1980s (14% England, 44% Wales).

We used fl exible multivariable parametric excess hazard models14,15 to estimate net survival up to 10 years after diagnosis for each nation, and for each stratum defi ned by cancer, sex, age group, and calendar period. The models included age and year of diagnosis as main eff ects, modelled on a continuous scale with restricted cubic splines, to account for potential non-linear excess (cancer-related) hazards. Interactions between age and year of diagnosis, year of diagnosis and follow-up time, and age and follow-up time were assessed to deal with potential variation of the excess hazard with time since diagnosis. The best-fi tting models were chosen as those with the smallest Akaike Information Criterion.16 Net survival curves were estimated for each individual from these models according to their age and year of diagnosis. We obtained net survival estimates for each cancer and sex by averaging of individual net survival curves, over all ages and years of diagnosis within each age group and calendar period. In view of the fact that the models included the year of diagnosis as a continuous variable, we were able to predict survival up to 10 years after diagnosis, even for the patients diagnosed most recently (ie, 2010–11). All models were fi tted with the STATA command stpm2 using STATA 13.1.17,18

We included all patients diagnosed during the 40 years from 1971 to 2011 in the models to estimate survival trends, but we report estimates for each cancer survival index at 1, 5, and 10 years after diagnosis only for six selected periods of diagnosis: 1971–72, 1980–81, 1990–91, 2000–01, 2005–06, and 2010–11. We defi ne the diff erence the oldest (75–99 years) and youngest (15–44 years) groups as the age gap in survival. We provide a simple summary of changes in survival by age as the absolute change (%) in the age gap since 1971. A negative value for this change means that the age gap has become wider. For Wales, reliable estimates of net survival could not be obtained for 11·5% of the age-sex-cancer combinations because

in net survival between

1208

1971–72

1980–81

1990–91

2000–01

2005–06

2010–11 (prediction)

1 year

5 years

10 years

1 year

5 years

10 years

1 year

5 years

10 years

1 year

5 years

10 years

1 year

5 years

10 years

1 year

5 years

10 years

53·3% 54·1% 52·2%

10·6% 10·2% 11·0%

15·0% 14·7% 15·6%

15·4% 15·3% 15·5%

41·5% 42·6% 40·4%

All cancers com bined 50·1% All patients Men 44·7% Women 55·5% Oeso phagus All patients Men Women Stomach All patients Men Women Colon All patients Men Women Rectum All patients Men Women Pancreas All patients Men Women Larynx Men Lung All patients Men Women Me lanoma of skin All patients Men Women Breast Women Cervix Women Uterus Women Ovary Women Prostate Men Testis Men Kidney All patients Men Women

81·6% 74·5% 86·7%

44·9% 45·4% 43·9%

43·7%

66·1%

81·9%

74·0%

75·6%

80·7%

16·0% 16·3% 15·4%

83·3%

29·8% 24·0% 25·2% 19·9% 27·9% 34·3%

55·8% 35·3% 50·6% 29·6% 61·0% 40·9%

28·8% 23·3% 34·1%

4·3% 4·0% 4·8%

5·2% 5·2% 5·3%

24·6% 25·3% 23·8%

24·2% 23·6% 25·0%

2·3% 2·4% 2·2%

3·5% 3·3% 3·9%

4·0% 4·0% 4·0%

22·8% 23·0% 22·6%

20·1% 19·1% 21·6%

1·2% 1·3% 1·1%

19·1% 18·5% 20·0%

20·6% 20·7% 20·5%

5·3% 4·8% 6·2%

8·2% 8·1% 8·4%

54·0% 34·2% 34·6% 55·2% 52·7% 33·8%

60·6% 61·4% 59·5%

12·1% 12·4% 11·9%

32·5% 32·0% 33·2%

2·8% 3·1% 2·4%

4·3% 3·8% 5·0%

6·7% 6·7% 6·8%

31·8% 31·5% 32·1%

28·2% 27·1% 29·6%

1·5% 1·7% 1·2%

60·6% 55·7% 65·3%

24·2% 24·1% 24·3%

26·8% 27·0% 26·5%

62·1% 63·5% 60·7%

6·5% 6·1% 7·1%

10·9% 10·6% 11·5%

41·6% 41·9% 41·3%

67·8% 42·0% 68·7% 41·7% 66·6% 42·4%

13·0% 13·5% 12·5%

2·8% 3·2% 2·4%

41·0% 34·8% 47·2%

34·4% 28·0% 40·7%

64·9% 47·4% 41·6% 60·7% 42·0% 36·0% 47·0% 69·0% 52·7%

5·1% 4·8% 5·6%

8·9% 8·6% 9·4%

38·6% 38·1% 39·0%

37·7% 36·7% 39·0%

1·5% 1·7% 1·3%

31·1% 32·5% 28·8%

8·8% 9·1% 8·2%

33·9% 14·1% 34·7% 13·9% 32·4% 14·5%

7·0% 7·3% 6·5%

11·3% 11·0% 11·8%

47·5% 44·5% 66·7% 68·1% 47·6% 43·6% 65·4% 47·5% 45·4%

74·0% 51·2% 47·1% 74·8% 51·0% 46·4% 72·8% 51·4% 48·2%

14·7% 15·3% 14·0%

2·7% 3·0% 2·4%

1·2% 1·4% 1·1%

67·6% 63·7% 71·5%

36·4% 38·3% 33·4%

37·8% 39·3% 35·2%

70·3% 71·9% 68·6%

76·7% 77·5% 75·6%

17·4% 18·1% 16·7%

50·9% 45·8% 45·8% 41·0% 56·0% 50·5%

70·5% 54·3% 49·8% 66·7% 49·2% 45·7% 74·2% 59·2% 53·8%

11·5% 12·0% 10·8%

9·3% 9·4% 9·1%

42·0% 15·3% 12·4% 44·3% 15·6% 12·0% 38·6% 14·7% 13·1%

16·3% 13·1% 16·5% 13·0% 16·1% 13·1%

41·7% 18·8% 15·0% 43·8% 19·5% 15·3% 37·9% 17·7% 14·4%

52·6% 50·3% 52·9 49·4% 52·3% 51·1%

73·9% 58·2% 56·9% 76·1% 59·2% 56·5% 71·7% 57·3% 57·4%

55·5% 51·7% 55·4% 51·0% 55·7% 52·7%

79·2% 59·7% 56·1% 79·9% 59·6% 55·5% 78·1% 59·8% 57·0%

3·0% 3·2% 2·7%

1·2% 1·2% 1·2%

20·9% 21·7% 20·2%

3·3% 3·6% 3·1%

1·1% 1·1% 1·1%

60·2%

50·4%

81·7%

62·1%

52·6%

82·8% 64·1%

54·9%

83·7% 66·0% 57·0%

84·2%

67·0% 58·2%

84·7% 67·9% 59·2%

4·6% 4·8% 4·3%

3·1% 3·2% 2·9%

18·3% 18·6% 17·8%

5·5% 5·8% 5·0%

3·7% 3·9% 3·2%

20·5% 20·4% 20·7%

6·0% 6·1% 5·9%

3·8% 3·9% 3·7%

24·4% 23·9% 25·2%

6·9% 6·6% 7·4%

4·0% 3·7% 4·5%

52·3% 46·4% 34·9% 40·5% 61·1% 54·9%

88·7% 66·4% 60·4% 56·4% 49·8% 84·5% 91·8% 73·7% 68·3%

93·1% 77·2% 90·8% 69·8% 94·9% 82·6%

71·9% 63·4% 78·2%

95·5% 83·8% 79·7% 94·0% 78·4% 73·3% 96·6% 87·8% 84·5%

28·0% 27·0% 29·7%

96·4% 95·2% 97·3%

8·0% 7·4% 9·1%

4·4% 3·8% 5·4%

32·2% 30·5% 35·1%

9·6% 8·4% 11·6%

5·0% 4·0% 6·6%

87·0% 84·4% 82·6% 79·3% 90·2% 88·3%

97·4% 90·4% 89·8% 96·6% 87·8% 86·8% 97·9% 92·4% 92·1%

52·7%

40·1%

85·9% 61·2%

48·4%

89·5%

71·1%

60·0%

92·7% 80·2%

71·6%

94·5%

83·9% 75·6%

96·0% 86·7% 78·5%

51·3% 46·0%

78·6% 58·3%

52·4%

81·6% 62·6%

57·2%

82·8% 65·4% 60·7%

82·6% 66·3% 61·9%

82·9% 67·5% 63·1%

59·0%

55·5%

79·5%

65·1%

61·5%

83·3% 69·5%

65·6%

86·9% 73·1% 69·7%

88·7%

75·9% 73·3%

90·3% 78·8% 77·4%

20·5%

17·9%

50·2%

24·9%

21·5%

57·0%

30·8%

26·4%

64·7% 38·4% 31·7%

68·8%

42·4% 33·5%

72·7% 46·4% 34·8%

36·9%

25·1%

71·5%

38·2%

24·4%

79·6% 49·6%

34·1%

89·5%

73·8% 62·4%

92·4%

81·4% 75·1%

94·0% 84·8% 83·6%

70·5% 69·2%

91·2% 84·0%

83·3%

95·8% 92·3%

91·9%

98·0% 96·3% 96·2%

98·7%

97·5% 97·4%

99·1% 98·3% 98·2%

28·5% 28·9% 28·0%

23·0% 23·0% 23·1%

51·3% 52·6% 49·1%

34·1% 35·3% 32·2%

27·6% 28·5% 26·1%

57·1% 58·7% 54·4%

39·4% 40·8% 37·1%

32·3% 33·4% 30·5%

62·8% 44·8% 37·9% 63·9% 45·2% 37·8% 60·9% 44·0% 38·0%

67·2% 68·0% 65·9%

49·8% 43·0% 50·0% 42·9% 49·4% 43·2%

72·5% 56·3% 49·6% 73·2% 56·7% 50·0% 71·3% 55·6% 48·9% (Table 2 continues on next page)

1209

1971–72

1980–81

1990–91

2000–01

2005–06

2010–11 (prediction)

1 year

5 years

10 years

1 year

5 years

10 years

1 year

5 years

10 years

1 year

5 years

10 years

1 year

5 years

10 years

1 year

5 years

10 years

7·2% 6·6% 7·9%

56·5% 54·2% 59·4%

29·9% 29·3% 30·6%

17·7% 17·6% 17·9%

75·6% 73·9% 77·8%

39·3% 60·2% 62·8% 40·9% 53·4% 35·2%

(Continued from previous page) Bladder All patients Men Women Brain All patients Men Women Hodgkin’s disease All patients Men Women Non-Hodgkin lym phoma All patients Men Women Multiple myeloma All patients Men Women Leu kaemia All patients Men Women Other cancers* All patients Men Women

49·5% 49·4% 49·6%

11·8% 12·1% 11·4%

13·1% 13·1% 13·0%

37·4% 36·8% 38·0%

34·2% 35·4% 32·5%

55·3% 57·3% 53·0%

32·4% 33·7% 29·0%

5·4% 5·0% 6·0%

47·7% 45·2% 51·0%

22·0% 21·7% 22·3%

6·2% 6·8% 5·5%

6·9% 6·6% 7·2%

73·4% 56·0% 48·0% 76·0% 57·9% 49·3% 66·6% 50·8% 44·7%

52·8% 77·2% 60·8% 80·1% 54·2% 63·0% 69·6% 54·9% 49·0%

74·7% 56·4% 49·5% 78·5% 59·2% 52·0% 64·7% 49·1% 43·0%

23·3% 23·3% 23·3%

9·8% 9·2% 10·6%

7·2% 6·7% 7·8%

82·7% 66·8% 58·8% 65·1% 82·2% 56·5% 61·8% 83·3% 69·2%

27·7% 27·9% 27·4%

87·6% 87·5% 87·7%

11·8% 11·2% 12·7%

8·4% 7·9% 9·2%

30·4% 12·7% 30·9% 12·1% 29·8% 13·7%

8·8% 8·3% 9·5%

75·1% 69·2% 74·6% 68·7% 75·8% 69·9%

90·0% 80·3% 75·8% 89·7% 80·4% 75·8% 75·8% 90·3% 80·2%

58·8% 37·5% 58·6% 36·8% 59·0% 38·4%

28·1% 27·6% 28·8%

65·8% 44·9% 65·7% 44·2% 45·8% 66·0%

48·4% 47·8% 49·0%

47·3% 48·6% 45·6%

54·7% 54·3% 55·2%

17·2% 17·2% 17·1%

23·6% 23·7% 23·5%

36·5% 35·2% 37·9%

8·6% 9·0% 8·1%

14·9% 14·4% 15·6%

32·0% 30·7% 33·4%

57·4% 57·4% 57·3%

22·0% 22·2% 21·8%

57·8% 34·0% 59·4% 34·4% 55·8% 33·6%

35·2% 54·5% 52·6% 31·9% 56·6% 39·0%

35·2% 34·5% 35·9%

10·8% 11·1% 10·3%

24·0% 23·6% 24·6%

30·2% 26·9% 33·9%

70·1% 52·3% 43·9% 70·0% 51·6% 43·4% 53·2% 44·6% 70·2%

27·7%

64·5% 14·3% 65·7% 28·8% 15·1% 63·2% 26·4% 13·4%

63·8% 41·6% 32·3% 65·6% 42·4% 32·3% 61·4% 40·5% 32·2%

56·6% 37·1% 32·5% 55·0% 33·7% 29·2% 58·4% 41·0% 36·3%

73·5% 77·6% 63·0%

34·7% 35·3% 33·9%

90·8% 90·3% 91·4%

74·3% 74·4% 74·3%

70·6% 71·8% 69·3%

66·3% 68·3% 63·7%

59·7% 58·7% 60·9%

54·8% 49·2% 57·8% 52·4% 47·0% 40·9%

72·4% 53·4% 49·5% 76·6% 56·5% 53·5% 61·4% 45·3% 39·1%

15·0% 10·6% 14·2% 9·9% 16·1% 11·5%

40·1% 18·5% 13·5% 41·1% 17·8% 12·8% 38·8% 19·5% 14·5%

82·9% 78·3% 82·5% 77·2% 83·4% 79·7%

91·4% 85·0% 80·0% 90·8% 84·1% 77·7% 92·3% 86·3% 83·1%

59·7% 52·6% 59·1% 51·9% 60·5% 53·3%

79·6% 68·8% 63·1% 79·8% 68·1% 62·2% 79·4% 69·5% 64·1%

36·0% 21·4% 37·9% 23·5% 34·0% 19·2%

76·7% 47·0% 32·6% 78·0% 50·0% 36·8% 75·3% 43·8% 27·9%

46·4% 38·7% 47·7% 39·4% 44·6% 37·8%

68·6% 51·5% 46·1% 70·7% 53·3% 47·6% 65·9% 49·1% 44·2%

40·6% 36·6% 37·8% 33·9% 43·9% 39·7%

63·5% 45·2% 41·9% 63·1% 43·3% 40·1% 63·9% 47·5% 44·0%

38·4% 34·8% 40·4% 36·9% 36·2% 32·5%

*Other cancers: all other malignant tumours are combined; they also include laryngeal cancer in women and breast cancer in men.

Table 2: 40-year trends in the index of net survival for all cancers combined at 1, 5, and 10 years after diagnosis in adults (15–99 years) in England from 1971 to 2011 and trends in the age-adjusted net survival for 21 selected cancers in England from 1971 to 2011 by sex

of the small number of patients, and broader age groups were constructed to re-estimate survival for those combinations.

Role of the funding source The funder had no role in study design, quality control, analysis, interpretation of the results, drafting, or the decision to submit for publication. The corresponding author had full access to all data and was responsible for the decision to publish.

Results The index of net survival for all cancers combined at 1, 5, and 10 years since diagnosis increased substantially between 1971 and 2011 in England and Wales (fi gure 1, tables 2 and 3). The all-cancers survival index was 50% at 1 year after diagnosis for patients diagnosed in 1971–72. For patients diagnosed during 2005–06, the index was 50% at 5 years after diagnosis, and for patients diagnosed

during 2010–11, we predict that the all-cancers survival index will reach 50% at 10 years after diagnosis.

For patients diagnosed during 2010–11, the survival index for all cancers combined had reached 69–70% at 1 year and a predicted value of 54% at 5 years for both sexes combined. The 5-year survival index rose by 24% (from 30% to 54%) and the 10-year survival index by 26% (from 24% to 50%) between the periods 1971–72 and 2010–11. Most of the increase occurred between 1990 and 2011.

The survival index for all cancers combined is on average 10% higher for women than for men at each time interval since diagnosis. The pattern of increase in the index was fairly similar for both men and women during the whole period, although the increase was linear for women but it became steeper for men after 1990–91. For patients diagnosed during 2010–11, the all- cancers survival index for women in England was 74% at 1 year, 59% at 5 years, and 54% at 10 years, whereas the fi gures for men were 67% at 1 year, 49% at 5 years, and

1210

1971–72

1980–81

1990–91

2000–01

2005–06

2010–11 (prediction)

1 year

5 years

10 years

1 year

5 years

10 years

1 year

5 years

10 years

1 year

5 years

10 years

1 year

5 years

10 years

1 year

5 years

10 years

48·1% 28·9% 23·4% 42·9% 24·8% 20·4% 53·2% 32·9% 26·3%

53·6% 34·7% 48·2% 28·8% 58·9% 40·5%

28·9% 23·7% 34·1%

58·4% 40·4% 34·6% 53·2% 33·9% 28·1% 63·4% 46·8% 41·1%

63·2% 46·9% 41·6% 59·1% 41·5% 35·9% 67·2% 52·2% 47·2%

66·3% 50·6% 46·0% 62·7% 45·7% 41·0% 69·9% 55·5% 51·0%

69·4% 54·2% 65·9% 49·2% 72·8% 59·0%

50·2% 45·5% 54·8%

9·5% 8·7% 10·8%

15·2% 15·3% 15·0%

16·9% 17·9% 15·2%

All cancers combined All patients Men Women Oesophagus All patients Men Women Stomach All patients Men Women Colon All patients Men Women Rectum All patients Men Women Pancreas All patients Men Women Larynx Men Lung All patients Men Women Melanoma of skin All patients Men Women Breast Women Cervix Women Uterus Women Ovary Women Prostate Men Testis Men Kidney All patients Men Women

15·6% 14·6% 17·4%

5·2% 5·1% 5·4%

5·7% 5·6% 6·0%

4·1% 3·8% 4·8%

4·6% 4·5% 4·9%

42·7% 25·0% 22·8% 43·1% 26·5% 24·5% 42·2% 23·4% 21·2%

50·8% 22·9% 19·7% 50·6% 21·4% 17·9% 51·0% 25·2% 22·1%

18·7% 19·1% 18·1%

6·0% 5·8% 6·4%

10·1% 21·3% 9·7% 21·0% 21·9% 10·8%

51·8% 33·3% 51·9% 33·3% 51·8% 33·3%

58·5% 31·2% 58·7% 29·9% 58·1% 33·1%

12·2% 11·5% 12·9%

3·8% 4·0% 3·7%

2·4% 2·7% 2·1%

12·8% 13·0% 12·5%

4·6% 5·6% 3·7%

5·2% 4·9% 5·6%

8·9% 8·6% 9·4%

30·9% 30·9% 31·0%

27·7% 26·1% 30·0%

3·4% 4·6% 2·3%

22·8% 23·2% 22·1%

6·9% 7·0% 6·8%

24·7% 10·8% 25·0% 10·6% 11·2% 24·1%

5·8% 6·0% 5·5%

9·2% 9·1% 9·3%

30·7% 32·8% 27·4%

8·8% 9·3% 7·9%

6·7% 7·1% 6·1%

35·5% 10·6% 37·7% 10·9% 32·1% 10·3%

7·9% 7·8% 8·0%

39·7% 42·3% 35·8%

12·9% 12·7% 13·3%

30·9% 12·6% 32·3% 12·6% 12·7% 28·5%

9·9% 9·9% 10·1%

36·5% 15·5% 12·0% 38·2% 15·5% 11·7% 33·5% 15·6% 12·4%

19·5% 19·4%

14·9% 43·1% 14·4% 45·0% 39·6% 19·6% 16·0%

58·4% 39·8% 37·2% 60·0% 40·3% 37·4% 56·9% 39·2% 37·0%

63·2% 45·2% 42·4% 65·8% 46·6% 43·3% 60·5% 43·7% 41·6%

67·8% 50·9% 48·3% 70·2% 51·8% 48·5% 65·4% 49·9% 48·0%

73·0% 74·9% 71·1%

57·7% 55·4% 57·9% 54·9% 57·5% 55·8%

65·7% 40·6% 37·1% 66·5% 39·8% 35·9% 64·6% 41·7% 38·9%

72·4% 50·0% 46·7% 73·2% 49·5% 45·8% 71·4% 50·8% 48·0%

75·2% 54·4% 51·3% 76·1% 54·1% 50·6% 74·0% 54·8% 52·3%

55·6% 77·7% 58·5% 78·6% 58·4% 55·1% 76·4% 58·6% 56·4%

12·9% 13·5% 12·4%

4·2% 5·0% 3·3%

2·8% 3·7% 2·0%

14·0% 14·8% 13·3%

3·0% 3·4% 2·6%

1·5% 1·8% 1·3%

16·3% 16·7% 15·8%

3·0% 3·4% 2·7%

1·3% 1·5% 1·2%

19·0% 19·4% 18·6%

3·3% 3·7% 2·9%

1·2% 1·4% 1·1%

77·7% 56·3% 45·9%

82·5% 64·8%

55·6%

82·1% 63·9% 54·5%

80·2% 60·4% 50·4%

81·4% 63·3% 53·7%

84·0% 68·1%

59·5%

5·1% 4·2% 6·6%

3·6% 2·8% 5·1%

18·7% 18·6% 18·8%

7·2% 7·2% 7·0%

5·5% 5·6% 5·3%

19·7% 19·5% 20·1%

6·8% 6·7% 6·9%

4·7% 4·6% 4·9%

21·5% 21·1% 22·2%

5·9% 5·5% 6·6%

3·3% 2·9% 4·0%

25·5% 24·4% 27·4%

6·9% 6·3% 8·0%

3·6% 3·1% 4·3%

31·1% 28·8% 35·2%

8·6% 7·7% 10·3%

4·2% 3·7% 5·1%

79·9% 51·1% 44·0% 73·8% 38·9% 33·3% 84·4% 60·1% 52·0%

74·9% 47·9% 34·8%

73·9% 52·8% 47·4%

72·7% 55·9% 53·4%

48·2% 22·2% 18·0%

62·7% 36·6% 27·8%

82·9% 69·5% 66·2%

43·7% 29·0% 24·4% 44·8% 30·6% 25·3% 41·9% 26·4% 22·9%

82·3% 63·1% 76·6% 51·0% 86·5%

57·2% 44·6% 72·0% 66·6%

85·6% 71·4% 66·3% 81·8% 62·5% 55·9% 88·3% 78·0% 73·9%

77·5%

91·3% 72·9% 89·4% 71·0% 65·8% 92·7% 82·2% 78·1%

94·4% 82·4% 77·6% 93·1% 76·4% 68·9% 95·3% 86·7% 84·1%

96·8% 89·0% 82·1% 95·8% 83·7% 68·3% 97·6% 92·9% 92·2%

81·8% 60·3%

48·5%

87·4% 71·7% 62·3%

91·4% 80·4% 73·4%

93·0% 83·8% 77·9%

94·3% 86·7%

81·8%

80·0% 63·2%

57·8%

78·6% 59·9% 55·0%

78·5% 59·9% 55·2%

79·7% 62·4% 57·5%

81·7%

65·6% 60·3%

76·2% 61·7%

56·8%

80·6% 67·0% 62·2%

85·3%

72·4% 69·6%

88·1% 76·8% 73·9%

90·5%

81·2%

77·8%

52·0% 26·2%

21·8%

56·9% 31·4% 26·6%

61·1% 36·6% 31·8%

63·1% 39·2% 34·4%

65·1%

41·9%

37·1%

65·9% 35·9%

25·6%

72·9% 44·6% 32·9%

85·0% 68·8% 59·1%

90·1% 79·8% 74·9%

93·7%

87·1%

87·1%

89·9% 81·1%

80·0%

94·4% 89·7% 89·1%

97·1% 95·0% 94·1%

97·4% 96·0% 94·4%

97·4% 96·6% 93·9%

46·9% 31·0% 48·5% 32·0% 44·2% 29·5%

25·1% 25·6% 24·4%

53·0% 36·5% 29·7% 54·6% 37·2% 30·0% 50·3% 35·4% 29·2%

61·6% 46·2% 39·5% 62·3% 46·9% 39·9% 60·5% 45·1% 38·7%

66·6% 51·2% 44·0% 67·6% 51·4% 43·5% 64·8% 50·8% 44·8%

70·8% 72·2% 68·5%

55·2% 47·3% 53·9% 44·2% 57·3% 52·4% (Table 3 continues on next page)

1211

1971–72

1980–81

1990–91

2000–01

2005–06

2010–11 (prediction)

1 year

5 years

10 years

1 year

5 years

10 years

1 year

5 years

10 years

1 year

5 years

10 years

1 year

5 years

10 years

1 year

5 years

10 years

*Other cancers: all other malignant tumours are combined; they also include laryngeal cancer in women and breast cancer in men

Table 3: 40-year trends in the index of net survival for all cancers combined at 1, 5, and 10 years after diagnosis in adults (15–99 years) in Wales from 1971 to 2011 and trends in the age-adjusted net survival for 21 selected cancers in Wales from 1971 to 2011 by sex

46% at 10 years. Both the levels and the trends in the all- cancers survival index were similar in England and Wales. The average absolute diff erence between the two countries was less than 1% (fi gure 1, tables 2 and 3).

Survival for both sexes combined varied widely for diff erent cancers, with the most recent predicted 10-year net survival adjusted for age and sex ranging from only 1·1% for pancreatic cancer to 98·2% for testicular cancer. A scatter-plot of the 1-year, 5-year, and 10-year survival estimates for adults diagnosed in 2010–11 against the absolute change since 1971 enables three broad clusters of cancers to be identifi ed (fi gure 2). The fi rst cluster consists of cancers with high survival in 2010–11 for which the absolute increase in survival since 1971–72 is progressively larger for survival at 1, 5, and 10 years. It includes cancers of the breast, prostate, testis, and uterus, and melanoma and Hodgkin’s disease.

The second cluster is of cancers with a moderate level of survival (64–84%) in 2010–11 and, generally, smaller

increases since 1971–72. This cluster consists of cancers of the larynx, cervix, rectum, colon, bladder, ovary, and kidney, with non-Hodgkin lymphoma, multiple myeloma, and leukaemia. For multiple myeloma and leukaemia, age-adjusted 10-year survival rose by more than 22% between the periods 1990–91 and 2010–11, from around 10·8% to a predicted 32·6% for multiple myeloma and from 24·0% to 46·1% for leukaemia (table 2).

The third cluster is of cancers for which survival for patients diagnosed during 2010–11 is still low, and for which little or no improvement has occurred in the past 40 years: this group consists of malignancies of the brain, stomach, lung, oesophagus, and pancreas.

This clustering can be seen as early as 1 year after diagnosis, and each cancer is in the same cluster, irrespective of the time since diagnosis (and the nation). We observed the largest absolute change in the age- adjusted survival for multiple myeloma, leukaemia, and prostate cancer.

1212

Testis

Breast Uterus

Melanoma

Hodgkin’s

Prostate

Larynx Cervix

Bladder

Rectum NHL Colon Kidney

Myeloma Leukaemia

Ovary

Other cancers

Brain

Stomach

Oesophagus

Lung

Testis

Hodgkin’s Breast

Uterus

Melanoma

Prostate

Larynx

Cervix

NHL

Colon

Rectum

Kidney

Bladder

Leukaemia

Myeloma

Other cancers

Ovary

Brain

Stomach Oesophagus

Lung

Pancreas

Testis

Melanoma

Hodgkin’s

Uterus

Breast

Prostate

Larynx Cervix

Bladder

Colon Kidney

NHL

Rectum

Leukaemia

Other cancers

Myeloma

Ovary

Brain

Stomach

Oesophagus

Lung Pancreas

Testis

Melanoma

Uterus

Hodgkin’s

Breast

Prostate

Testis

Prostate

Melanoma

Hodgkin’s

Breast

Uterus

Larynx

Cervix

Colon

Kidney

NHL

Rectum

Bladder Other cancers

Leukaemia

Ovary

Myeloma

Cervix

Larynx Bladder

Colon

NHL Rectum

Kidney

Leukaemia

Other cancers

Ovary

Myeloma

Brain

Stomach Oesophagus

Lung

Pancreas

Brain

Stomach Oesophagus

Lung Pancreas

–5 0 5

15

25

35

45

55

65

–5 0 5

Absolute change since 1971 (%)

15

25

35

45

Absolute change since 1971 (%)

55

65

1213

England

100

Wales

Cluster 1

Testis

Breast Melanoma Hodgkin’s Uterus Larynx

Cervix

Prostate

Rectum

NHL

Myeloma

Colon

Ovary

Leukaemia

Bladder

Kidney

Other cancers

Cluster 2

Brain

Oesophagus

Stomach

Lung

Pancreas

Cluster 2

Pancreas

lung cancer has

1-year survival from

improved substantially, from 16% in 1971–72 to 32% in 2010–11. However, estimated long-term survival for patients diagnosed in 2010–11 is very poor for both sexes: as low as 10% at 5 years and 4% and 7% in men and women, respectively, at 10 years. This overall pattern of no improvement in long-term survival is common in the cluster of poor-prognosis cancers (oesophagus, stomach, pancreas, and brain), for men and women and for both England and Wales.

Survival for breast cancer has seen a rapid and substantial improvement during the past 40 years. 5-year survival increased from 53% in 1971–72 to a predicted value of 87% in 2010–11. After 10 years, survival rose from 40% in 1971–72 to a predicted 78% for patients diagnosed during 2010–11. Diff erences between 5-year and 10-year survival estimates remained broadly constant since 1971, showing that most of the improvements in long-term survival arose in the fi rst 5 years after diagnosis. Breast cancer accounted for nearly a third of all cancers in women, which partly explains the higher all-cancers survival index in women than in men.

Although survival from cancers of the colon and rectum is much lower than survival from breast cancer (around 20% lower in 2010–11), the trends in 1-year, 5-year, and 10-year survival for these two cancers have followed an almost identical pattern to that of breast cancer during the past 40 years.

For men diagnosed with prostate cancer during 2010–11, the predicted values for 5-year and 10-year estimates are almost identical at 85% and 84%, respectively, which are huge increases from the values of 37% and 25% for men diagnosed 40 years ago. The trends are quite distinct for short-term, medium-term, and long-term survival. In both England and Wales, 1-year survival has been increasing since 1971–72, whereas acceleration in 5-year survival started for men diagnosed in the 1980s; 10-year survival only began increasing for men diagnosed in the 1990s.

For women diagnosed with cancer of the ovary during 2010–11, the age-adjusted survival was predicted as 46% at 5 years and 35% at 10 years compared with 20% and 18%, respectively, for women diagnosed during 1971–72. These results suggest that the underlying increase in survival of up to 5 years is likely to continue.

Net survival is generally lower for the oldest patients (75–99 years) than the youngest (15–44 years), even though net survival accounts for a higher mortality from causes other than cancer in elderly patients. This fi nding is shown by a scatter-plot of the age gap in net survival at 1, 5, and 10 years after diagnosis for adults diagnosed in

Figure 2: Net survival adjusted for age and sex for each cancer in 2010–11, and absolute change since 1971, all adults (15–99 years), England and Wales: 1, 5, and 10 years after diagnosis The absolute change is the simple arithmetic diff erence between net survival in 2010–11 and the survival in 1971–72. NHL=non-Hodgkin lymphoma.

England

Wales

Testis

Melanoma

Prostate

Melanoma

Colon Larynx

Hodgkin’s

Rectum

Stomach

NHL Bladder

Testis

Pancreas

Oesophagus

Leukaemia

Kidney

Myeloma

Lung

Other cancers

Prostate

Oesophagus

Larynx

Colon

Rectum

Stomach Lung

NHL Kidney

Bladder

Myeloma

Pancreas

Leukaemia

Other cancers

Hodgkin’s

Brain

Brain

Prostate

1 1 0 2

–20

Testis

Colon

Oesophagus Pancreas Stomach

Lung

Melanoma Larynx

Rectum

Kidney

Leukaemia

NHL

Bladder

Prostate

Other cancers

Myeloma

Brain

Hodgkin’s

0

–10

1 1 0 Pancreas

Larynx Rectum

Hodgkin’s

NHL

Bladder

Colon

Melanoma

Oesophagus

Testis

Stomach

Lung

Kidney Leukaemia Myeloma Other cancers

Brain

Colon

Oesophagus Stomach

Pancreas

Rectum

Prostate Melanoma

Lung

Larynx

Testis

Brain

Kidney Other cancers

Hodgkin’s Bladder

NHL

Testis

Pancreas

Oesophagus Stomach Lung

Colon

Rectum Melanoma Larynx

1 1 0 2

–20

Prostate

–30

–40

Leukaemia

Kidney NHL

Brain

Bladder

Myeloma

–50

–60

–70

–80

Other cancers

Myeloma

Leukaemia

Hodgkin’s

–50 –40 –30 –20 –10

0

10

20

30 40

–50 –40 –30 –20 –10

0

10

20

30

40

Increase in age gap Decrease in age gap

Increase in age gap Decrease in age gap

Absolute change in age gap since 1971 (%)

Absolute change in age gap since 1971 (%)

2010–11 against the absolute change since 1971–72: it shows a negative gap in survival for most cancers (y-axis of fi gures 3 and 4).

The largest age gaps in survival in men were observed for cancers for which high-dose chemotherapy is the key treatment (lymphoma, multiple myeloma, and leukaemia), but we could not identify any overall temporal patterns. For women, the largest age gaps were noted for brain tumours, and cancers of the ovary and cervix, and multiple myeloma, but the clustering was less obvious than in men. The age gap tended to narrow for melanoma and cancer of the uterus in women but widened for long-term survival of ovarian cancer.

Discussion The index of net survival for all cancers combined has increased substantially: for patients diagnosed in 1971–72, the index was 50% at 1 year after diagnosis. Our prediction is that, for patients diagnosed during 2010–11, the all-cancers survival index will reach 50% at 10 years after diagnosis. Very similar patterns of change and levels of survival were noted in both England and Wales. Survival has increased steadily during the 40 years since 1971, with a slight acceleration in the past 10–15 years, particularly for 5-year and 10-year survival, in both England and Wales. After implementation of the NHS cancer plan for England,19 we reported a slight acceleration in the 1-year cancer survival trends during 2004–06, by contrast with Wales,2 where a national cancer plan was only introduced in 2006. The pattern was not so clear for survival at 3 years after diagnosis. The fi ndings reported here suggest a continuing acceleration of these trends for longer-term survival between 2005–06 and 2010–11 in England, but also in Wales (panel).

The completeness and quality of cancer registration and follow-up data in both England and Wales have been systematically assessed and are thought to be very high throughout the period 1971–2011, despite undeniable improvement during the 1970s–80s.21–23 This improvement cannot explain long-term trends in cancer survival.24,25 Furthermore, with the exception of bladder cancer, overall changes in disease defi nitions are limited, even for haemopoietic malignancies. To aff ect the survival index, such a change in disease defi nition would need to aff ect a substantial proportion of all cancers, for which prognosis would also need to be very diff erent from that for other cancers. These conditions are not met.

In some strata defi ned by age, sex, cancer, and calendar period of diagnosis, especially in Wales, few deaths

Figure 3: Age gap in net survival by cancer, men (15–99 years) diagnosed during 2010–11 versus absolute change† in the age gap since 1971, England and Wales: 1, 5, and 10 years after diagnosis The age gap represents the absolute diff erence (%) between net survival in the oldest (75–99 years) and youngest (15–45 years) groups of patients; a negative value means that survival is lower in the oldest group than the youngest group. †The absolute change is the simple arithmetic diff erence between the age gap in 2010–11 and the age gap in 1971–72. NHL=non-Hodgkin lymphoma.

1214

England

Wales

Melanoma

Breast

Melanoma

Breast

Uterus

Rectum

Hodgkin’s

Myeloma

Leukaemia

Stomach

Lung

Colon

NHL Bladder

Kidney

Oesophagus

Pancreas

Other cancers

Cervix

Ovary

Stomach Oesophagus

Pancreas

Myeloma

Lung Bladder Kidney

Other cancers

Uterus

Hodgkin’s

Leukaemia Rectum

Colon

NHL

Cervix

Ovary

Brain

Brain

Breast

Colon

Melanoma

Pancreas

Stomach

Oesophagus

Lung

Uterus

Rectum

Leukaemia

NHL

Kidney

Bladder

Hodgkin’s

Myeloma

Other cancers

Brain

Ovary

Cervix

Pancreas

Breast Melanoma

Hodgkin’s Uterus

Leukaemia

Colon Rectum

Oesophagus

Lung

Stomach

Myeloma

NHL

Kidney

Bladder

Cervix Ovary

Other cancers

Brain

Colon

Melanoma

Rectum

Uterus

Pancreas

Breast

Lung

Stomach

Oesophagus

Pancreas

Breast

Melanoma

Colon Rectum

Lung

Oesophagus Stomach

Hodgkin’s

Leukaemia

Uterus

–30

–40

–50

Bladder

Leukaemia

NHL

Kidney

Brain

Other cancers

Hodgkin’s

–60

Myeloma

Ovary

Cervix

Myeloma

Bladder

NHL

Ovary

Brain

Kidney

Cervix

Other cancers

–70

–80

–50 –40 –30 –20 –10

0

10

20

30 40

–50 –40 –30 –20 –10

0

10

20

30

40

Increase in age gap Decrease in age gap

Increase in age gap Decrease in age gap

Absolute change in age gap since 1971 (%)

Absolute change in age gap since 1971 (%)

0

occurred. To obtain more stable net survival estimates, we therefore estimated net survival using a modelling approach rather than the non-parametric Pohar-Perme approach.20

The index of net survival for all cancers combined provides one convenient number that summarises the overall patterns of cancer survival in any one population or country, in each calendar period for young and old men and women and for a wide range of cancers with very disparate survival. The index is unaff ected by changes in the proportion of cancers of diff erent lethality in either sex, such as the reduction of lung cancer or the increase in prostate cancer in men. Similarly, the index is unaff ected by ageing of the population of patients with cancer or shifts in the proportion of any cancer between men and women. The value of the index changes only when survival for one or more cancers changes, for one or more age groups. The index therefore shows overall progress in cancer management, whether from earlier diagnosis, or earlier stage of disease, or improved treatment and care.

However, the all-cancers survival index needs careful interpretation: for example, the predicted value of 50% for the 10-year all-cancers survival index for 2010–11 does not mean that half of all patients will be cured or “beat cancer”, as has been portrayed in the media.26 The index is designed as a public health measure that summarises cancer survival trends in an entire population, to help to assess progress in the overall eff ectiveness of the health system in diagnosis and management of patients with cancer. The index does not refl ect the prospects of survival for any individual patients with cancer. The index is based on net survival, which is an unbiased measure of population- based survival from cancer after adjustment for other causes of death. Net survival is the most valid available metric for comparison of survival between populations and for assessment of progress in cancer survival over time. The all-cancers net survival index should nevertheless be interpreted in conjunction with other information available in the population or country for which the index has been prepared. It should be seen as a guide to raise questions about the potential for improvement.

The average 10% diff erence in the survival index between men and women has been a consistent feature for 40 years. It arises because, for several individual cancers, survival is slightly higher for women, but mostly because the cancers that are most common in women, such as breast cancer (weight of 0·31 in the survival index for women), generally have higher survival than the cancers that are most common in men, such as lung

Figure 4: Age gap in net survival by cancer, women (15–99 years) diagnosed during 2010–11 versus absolute change† (%) in the age gap since 1971, England and Wales: 1, 5, and 10 years after diagnosis The age gap represents the absolute diff erence (%) between net survival in the oldest (75–99 years) and youngest (15–45 years) groups of patients; a negative value means that survival is lower in the oldest group than the youngest group. †The absolute change is the simple arithmetic diff erence between the age gap in 2010–11 and the age gap in 1971–72. NHL=non-Hodgkin lymphoma.

1215

See Online for appendix

Panel: Research in context

Systematic review Health policy measures to improve the organisation and delivery of services for the prevention, diagnosis, and treatment of cancer should be based on sound evidence. Population-based survival trends have proved to be a key metric for the overall eff ectiveness of health systems. An unbiased estimator of net survival was introduced in 2012.20 We have not undertaken a literature review, but so far, only a few countries have published population-based cancer survival using this estimator, including in England by our research group.12 No other country has constructed a single, summary index of net survival for all cancers combined. A simple, robust, one-number index of net survival for all cancers combined can contribute to the evidence base for rational health policy.

Interpretation Changes in the net survival index refl ect changes in survival for one or more cancers, not simply changes in the distribution of cancer patients by age, cancer site, or sex. The net survival index increased substantially between 1971 and 2011, representing a substantial gain in overall survival from all cancers combined. Net survival varied widely for diff erent cancers, and was generally lower for older patients than younger patients, even after adjustment for the higher mortality from other causes in older patients. Three clusters of cancers, with high, moderate, and low survival, can be distinguished as early as 1 year after diagnosis. Overall, the survival trends are encouraging in both England and Wales, but they also suggest strongly the need for renewed eff orts to achieve better outcomes.

cancer (weight of 0·22 in the index for men). The slight narrowing in the sex gap observed in the most recent periods might be explained by the rapid increase in survival for prostate cancer (weight of 0·19 in the index for men), particularly at 5 and 10 years after diagnosis. This rapid increase in survival for prostate cancer has been largely attributed to the widespread use of prostate-specifi c antigen (PSA) testing, resulting in the diagnosis of many less advanced tumours with a shift of the stage distribution to less advanced and less aggressive disease. However, importantly, survival had already started to increase, albeit more slowly, much before PSA testing was widely used.27 The more recent increase in long-term survival suggests that this improvement is not simply because of a shift in the stage distribution after increasingly wide use of the PSA test. The increase in short-term survival, which began as early as the 1970s, and the increase in 5-year survival in the 1980s and then in the 10-year survival in the following decade cannot simply be attributed to PSA.

We were able to group the 21 most common cancers into three clusters on the basis of their survival. Despite some large gains in survival, these clusters are, with few exceptions, the same in 2011 as in 1971 (data not shown).

The clusters are identifi able as early as 1 year after diagnosis, and they are consistent at 5 and 10 years after diagnosis, both in England and Wales.

Cluster 1 includes cancers with a good prognosis: survival is now very high, after a large increase since 1971, particularly at 5 and 10 years after diagnosis. 1-year survival seems to have reached a ceiling for most of these cancers, but survival at 5 and 10 years is still much lower than at 1 year for breast cancer and Hodgkin’s disease. The absence of any plateau in survival, even 10 years after diagnosis, shows that cure at the population level has still not been reached for these cancers, leaving room for substantial further improvement in long-term survival.

For most cancers in the other two clusters, survival at 5 and 10 years after diagnosis is still much lower than 1-year survival. The second cluster consists of a further mix of cancers for which either survival has remained moderate since the early 1970s, or moderate levels of survival in 2011 are the result of large improvements during the past 40 years. The second situation is well illustrated by the steep increase in survival from multiple myeloma since 2000–01, probably explained by the introduction of higher-dose treatment regimens around

  1. For the cancers in this cluster that have shown no evidence of improvement, eff orts should be made to achieve earlier diagnosis, and to focus on stricter guidelines for improved treatment, such as increased use of surgery, radiotherapy with curative intent, neoadjuvant therapies, or a combination of the three.

The eff ect of mass-screening on survival varies with the cancer. For cervical cancer, an effi cient screening programme does not necessarily lead to an improvement in survival because screening prevents the occurrence of invasive tumours, thereby reducing incidence, and the remaining patients are, on average, diagnosed with more advanced disease.28 A quasi-plateau in 1-year survival has been observed since 2000–01 (appendix 1 and 2).

By contrast, breast cancer screening aims to diagnose the disease at an early stage, rather than to prevent it. Its real eff ect on survival has been questioned mainly because of possible overdiagnosis and lead time. However, overdiagnosis does not exceed a few percent,29 and the advantage in survival remains important for screen- detected breast cancer after accounting for lead time.30 Improvement in breast cancer survival has been large because of both early diagnosis and improved treatment, although net survival continues to decrease even 10 years after diagnosis, showing late recurrences. The age gap in survival has also decreased, supporting more rapid improvement in survival for older women (and for the screened age group) than in younger women.31

to prevent

invasive malignant

Screening for colorectal cancer, which started in 2006, aims tumours (by removing polyps with adenoma tous change) and to diagnose cancer at an early stage. Therefore, although it is too recent to have any eff ect on these results, lessons from both cervical and breast cancer screening

1216

programmes will also help us to monitor the eff ect of screening on the prognosis of colorectal cancer.

in young patients

A wide age gap in survival was still present for most cancers in 2010–11. Some of these diff erences are related to screening or early diagnostic practices (breast, cervix, prostate). Also, the disease, and its prognosis, might radically diff er by age, such as leukaemia: the treatment of acute disease improved substantially, by contrast with chronic leukaemia in elderly patients, but separation of both diseases is not possible over the entire period 1971–2011. However, in other countries, the age gap in cancer survival is much narrower than in England and Wales.32,33 The wide age- related inequalities in cancer survival in England and Wales are thus likely to be avoidable. They could be substantially reduced.

1-year survival has improved substantially for cancers with a particularly poor prognosis (cluster 3), but longer- term survival (5 and 10 years after diagnosis) has hardly changed during the past four decades. Among these cancers, substantial improvements should be achievable for lung cancer: in 2011, National Institute for Health and Care Excellence (NICE) guidelines34 underlined the need for improved staging and increased widespread access to surgery and radiotherapy with curative intent for non- small-cell lung cancer. Adherence to these guidelines and their eff ect on cancer outcomes has not yet been exhaustively assessed.35

In summary, despite impressive overall improvements in cancer survival during the past 40 years in both England and Wales, the wide and persistent diff erences in survival between cancers, together with the wide and persistent age gap in survival for most cancers, suggest the need for renewed eff orts to achieve improved outcomes, particularly in elderly patients. The fi ndings reported here off er clues for focused research to dissect the underlying causes of these diff erences in cancer survival. The results should prompt action to improve public health in both England and Wales. This research will need systematic linkage of clinical audit streams and other detailed data streams to population-based cancer registry data, but the recent crisis of public concern about the sharing of individual health data for confi dential public health research will need to be resolved fi rst.36 Contributors MQ did the analysis. MQ and BR designed the analytic strategies and constructed the indexes. MQ, MPC, and BR wrote the Article and interpreted the findings.

kbenoit commented 5 years ago

Thanks - that will of course read the ligatures because they are part of your text file. readtext does not convert ligatures if the .txt file contains them.

But if you are converting from a pdf file, then we can help. My own tests showed that ligatures are correctly processed. Did you read the text above using readtext() where the source file was a pdf?

MSU2580 commented 5 years ago

Hello,

Specifying the UTF-8 encoding did help clear up many of the other issues we had when reading in the .txt file. However, I will see if we have access to the pdf files and go from there. I did utilize your work on the example pdf document and noted that it was effective, which may be the route we have to take. Thank you again.

Best,


Keith L. Johnson, Ph. D.

Montana State University

Physics Department, Barnard Hall 226

PER Group


From: Kenneth Benoit notifications@github.com Sent: Sunday, February 10, 2019 9:27:56 PM To: quanteda/readtext Cc: Johnson, Keith; Author Subject: Re: [quanteda/readtext] readtext issue: ligature artifacts remain, even after specifying UTF-8 encoding. (#146)

Thanks - that will of course read the ligatures because they are part of your text file. readtext does not convert ligatures if the .txt file contains them.

But if you are converting from a pdf file, then we can help. My own tests showed that ligatures are correctly processed. Did you read the text above using readtext() where the source file was a pdf?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/quanteda/readtext/issues/146#issuecomment-462216899, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AtNcbUMr3sfPHFjsiVtmVsNfEgRjZDkzks5vMPFMgaJpZM4ayygK.

kbenoit commented 5 years ago

One option to you is "Unicode normalization", which will convert the ligatures. (But I am not sure why you have a space above after each ligature - this would need to be dealt with separately, if it's actually part of your file.)

txt <- "substantial investments in human and financial 
resources for the effect of earlier diagnosis of waffles."

cat(stringi::stri_trans_nfkc(txt))
## substantial investments in human and financial 
## resources for the effect of earlier diagnosis of waffles.
MSU2580 commented 5 years ago

Dr. Benoit

Since last we communicated I have come across some pdf files (one example attached) which also seem to contain ligatures pertaining to "fi", "fl", etc. (as well as symbols such as lamba which do not concern us). I have again attached all the relevant data that you requested before. If you would like me to open up a separate case on github, let me know. Thank you for all your insight!

The sessionInfo() data:

R version 3.4.4 (2018-03-15) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C [5] LC_TIME=English_United States.1252

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] readtext_0.71 quanteda_1.3.14

loaded via a namespace (and not attached): [1] Rcpp_0.12.16 magrittr_1.5 stopwords_0.9.0 munsell_0.4.3 colorspace_1.3-2 lattice_0.20-35 R6_2.2.2 rlang_0.2.0 [9] fastmatch_1.1-0 stringr_1.3.0 httr_1.3.1 plyr_1.8.4 pdftools_2.1 tools_3.4.4 grid_3.4.4 data.table_1.10.4-3 [17] gtable_0.2.0 spacyr_1.0 yaml_2.1.18 lazyeval_0.2.1 RcppParallel_4.4.2 tibble_1.4.2 Matrix_1.2-12 ggplot2_2.2.1 [25] ISOcodes_2018.06.29 stringi_1.1.7 compiler_3.4.4 pillar_1.2.1 scales_0.5.0 lubridate_1.7.4

The code I utilized (as it pertains to just the attached file):

library(quanteda) library(readtext) tempsci <- readtext("hep-th9910196.pdf")

sciCorp <- corpus(tempsci)

doc_term_matrix <- dfm(sciCorp,remove = stopwords("english"),remove_punct = TRUE,remove_numbers=TRUE,stem = FALSE)

FinalSciCorp = textstat_frequency(doc_term_matrix)

Best,


Keith L. Johnson, Ph. D.

Montana State University

Physics Department, Barnard Hall 226

PER Group


From: Kenneth Benoit notifications@github.com Sent: Sunday, February 10, 2019 10:06:57 PM To: quanteda/readtext Cc: Johnson, Keith; Author Subject: Re: [quanteda/readtext] readtext issue: ligature artifacts remain, even after specifying UTF-8 encoding. (#146)

One option to you is "Unicode normalization", which will convert the ligatures. (But I am not sure why you have a space above after each ligature - this would need to be dealt with separately, if it's actually part of your file.)

txt <- "substantial investments in human and financial

resources for the effect of earlier diagnosis of waffles."

cat(stringi::stri_trans_nfkc(txt))

substantial investments in human and financial

resources for the effect of earlier diagnosis of waffles.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/quanteda/readtext/issues/146#issuecomment-462220742, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AtNcbYh8pPX8wN55B86JtKTmJbgfXCfnks5vMPpxgaJpZM4ayygK.

kbenoit commented 5 years ago

Thanks but please either upload them by dragging into the GitHub browser, or send them to me by email. (They did not show up above.)

MSU2580 commented 5 years ago

hep-th9910196.pdf

kbenoit commented 5 years ago

Thanks. That definitely contains the ligatures, and readtext::readtext() definitely does not normalize them. We can think about adding an option to readtext to do this automatically, but in the meantime, you can solve this "manually" using stringi:


library("quanteda", warn.conflicts = FALSE)
## Package version: 1.4.0
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
rtxt <- readtext::readtext("~/Downloads/hep-th9910196.pdf")

texts(rtxt) %>%
  kwic(c("in*nity", "*erential", "coe*cients", "*uctuation", "cuto*"), 1)
##                                                        
##  [hep-th9910196.pdf, 1523] the |  infinity   | so      
##  [hep-th9910196.pdf, 1614] the | fluctuation  | (       
##  [hep-th9910196.pdf, 1812] the | differential | equation
##  [hep-th9910196.pdf, 1826] the | coefficients  | of      
##  [hep-th9910196.pdf, 1987] the |   infinity   | ,       
##  [hep-th9910196.pdf, 2928] the |   infinity   | r       
##  [hep-th9910196.pdf, 3070] the |   infinity   | ,       
##  [hep-th9910196.pdf, 3760] are |    cutoff    | for

texts(rtxt) %>%
  stringi::stri_trans_nfkc() %>%
  kwic(c("in*nity", "*erential", "coe*cients", "*uctuation", "cuto*"), 1)
##                                             
##  [text1, 1523] the |   infinity   | so      
##  [text1, 1614] the | fluctuation  | (       
##  [text1, 1812] the | differential | equation
##  [text1, 1826] the | coefficients | of      
##  [text1, 1987] the |   infinity   | ,       
##  [text1, 2928] the |   infinity   | r       
##  [text1, 3070] the |   infinity   | ,       
##  [text1, 3760] are |    cutoff    | for